It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?

▲

andylizf 10 days ago | parent | next [-]

Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.

▲

iezepov 9 days ago | parent | prev | next [-]

Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data.

▲

catlifeonmars 9 days ago | parent [-]

Why would the embeddings be higher dimensionally than the data? I imagine the embeddings would contain relatively higher entropy (and thus lower redundancy) than many types of source data.

	▲	cm228 9 days ago \| parent [-]
		depends on the chunk-size used to create the embedding.

▲

yichuan 10 days ago | parent | prev | next [-]

I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me

▲

brookst 9 days ago | parent | prev [-]

Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win.