▲ | oblio 10 days ago | ||||||||||||||||
It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data? | |||||||||||||||||
▲ | andylizf 10 days ago | parent | next [-] | ||||||||||||||||
Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency. | |||||||||||||||||
▲ | iezepov 9 days ago | parent | prev | next [-] | ||||||||||||||||
Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data. | |||||||||||||||||
| |||||||||||||||||
▲ | yichuan 10 days ago | parent | prev | next [-] | ||||||||||||||||
I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me | |||||||||||||||||
▲ | brookst 9 days ago | parent | prev [-] | ||||||||||||||||
Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win. |