Remix.run Logo
oblio 10 days ago

It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?

andylizf 10 days ago | parent | next [-]

Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.

iezepov 9 days ago | parent | prev | next [-]

Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data.

catlifeonmars 9 days ago | parent [-]

Why would the embeddings be higher dimensionally than the data? I imagine the embeddings would contain relatively higher entropy (and thus lower redundancy) than many types of source data.

cm228 9 days ago | parent [-]

depends on the chunk-size used to create the embedding.

yichuan 10 days ago | parent | prev | next [-]

I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me

brookst 9 days ago | parent | prev [-]

Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win.