How is this different from running tuned HNSW vector indices on Elasticsearch?

ashvardanian 2 days ago | parent | next [-]

Lucene is tough to deal with. About 15 hours ago — right when this comment was posted — I was giving a talk at Databricks comparing the world’s most widely used search engines. I’ve never run into as many issues with any other similar tool as I did with Lucene. To be fair, it’s been around for ~26 years and has aged remarkably well... but it’s the last thing I’d choose today.

▲

ab5tract 2 days ago | parent | next [-]

Can I ask you which alternatives exist at the layer Lucene occupies?

I went looking around last year and couldn’t really find many options, but I might have been looking in the wrong places.

	▲	ashvardanian 5 hours ago \| parent [-]
		For Vector Search the top 2 are: Meta’s FAISS and (my) Unum’s USearch. Lucene powers Elastic, Solr, MongoDB Atlas, AWS OpenSearch, Azure Cognitive Search. USearch powers ClickHouse, DuckDB, YugaByte, TiDB, ScyllaDB, MemGraph, KuzuDB, Lantern, and a few big closed source names that don’t mention it, as far as I know. FAISS has the highest usage among Python developers, but if you are indexing large collections you should consider alternatives.

▲

cluckindan 2 days ago | parent | prev [-]

Interesting, then, that Vectroid would choose to fork it.

Elasticsearch is at least good / at hiding the Lucene zoo under the hood.

▲

talipozturk 3 days ago | parent | prev | next [-]

co-founder of Vectroid: We forked Lucene. Lucene is awesome for search in general, filters, and obviously full-text search. Very mature and well supported by so many big names and amazing engineers. So we take advantage of that but we had to change a few things to make it work perfectly for Vector use-case. We basically think Vector should be the main data type as it is the most difficult one to deal with. For instance, we modified Lucene to use X number of CPU / threads to build a single segment index. As a result, if/when needed, we can utilize hundreds of CPUs to index quicker and generate less number of segments that will enable lower query latency. We also built a custom File System Directory for Lucene to work off of GCS directly (or S3 later on). It can by-pass the kernel, read from network and write directly into the memory... no SSD, no page-cache, no mmap involved. Perhaps I should not say more...

▲

wwdmaxwell 3 days ago | parent | prev [-]

Aside from being serverless. This is like elasticsearch but with a kind of built in redis-like layer, I think.