I like that this relies on generating SQL rather than just being a black-box chat bot. It feels like the right way to use LLMs for research: as a translator from natural language to a rigid query language, rather than as the database itself. Very cool project!

Hopefully your API doesn't get exploited and you are doing timeouts/sandboxing -- it'd be easy to do a massive join on this.

I also have a question mostly stemming from me being not knowledgeable in the area -- have you noticed any semantic bleeding when research is done between your datasets? e.g., "optimization" probably means different things under ArXiv, LessWrong, and HN. Wondering if vector searches account for this given a more specific question.

▲

Xyra 6 hours ago | parent | next [-]

Exactly, people want precision and control sometimes. Also it's very hard to beat SQL query planners when you have lots of material views and indexes. Like this is a lot more powerful for most use cases for exploring these documents than if you just had all these documents as json on your local machine and could write whatever python you wanted.

Yeah I've out a lot of care into rate-limiting and security. We do AST parsing and block certain joins, and Hacker News has not bricked or overloaded my machine yet--there's actually a lot more bandwidth for people to run expensive queries.

As for getting good semantic queries for different domains, one thing Claude can do besides use our embed endpoint to embed arbitrary text as a search vector, is use compositions of centroids (averages) of vectors in our database, as search vectors. Like it can effortlessly average every lesswrong chunk embedding over text mentioning "optimization" and search with that. You can actually ask Claude to run an experiment averaging the "optimization" vectors from different sources, and see what kind of different queries you get when using them on different sources. Then the fun challenge would be figuring out legible vectors that bridge the gap between these different platform's vectors. Maybe there's half the cosine distance when you average the lesswrong "optimization" vector with embed("convex/nonconvex optimization, SGD, loss landscapes, constrained optimization.")

▲

bredren 9 hours ago | parent | prev | next [-]

This is the route I went for making Claude Code and Codex conversation histories local and queryable by the CLIs themselves.

Create the DB and provide the tools and skill.

This blog entry explains how: https://contextify.sh/blog/total-recall-rag-search-claude-co...

It is a macOS client at the present but I have a Linux-ready engine I could use early feedback on if anyone is interested in giving it a go.

▲

keeeba 19 hours ago | parent | prev | next [-]

I don’t have the experiments to prove this, but from my experience it’s highly variable between embedding models.

Larger, more capable embedding models are better able to separate the different uses of a given word in the embedding space, smaller models are not.

	▲	Xyra 11 hours ago \| parent \| next [-]
		I'm using Voyage-3.5-lite at halfvec(2048), which with my limited research, seems to be one of the best embedding models. There's semi-sophisticated (breaking on paragraphs, sentences) ~300 token chunking. When Claude is using our embed endpoint to embed arbitrary text as a search vector, it should work pretty well cross-domains. One can also use compositions of centroids (averages) of vectors in our database, as search vectors.
	▲	A4ET8a8uTh0_v2 19 hours ago \| parent \| prev [-]
		I was thinking about it a fair bit lately. We have all sorts of benchmarks that describe a lot of factors in detail, but all those are very abstract and yet, those do not seem to map clearly to well observed behaviors. I think we need to think of a different way to list those.

▲

llmslave2 8 hours ago | parent | prev [-]

> I like that this relies on generating SQL rather than just being a black-box chat bot.

When people say AI is a bubble but will still be transformational, I think of stuff like this. The amount of use cases for natural language interpretation and translation is enormous even without all the BS vibe coding nonsense. I reckon once the bubble pops most investment will go into tools that operate something like this.