| ▲ | aubreypc 8 hours ago | |
Would you mind sharing any lessons learned / which parameters you were experimenting with? I'm working on a Vespa hybrid lexical + HNSW retrieval system at the moment with quite a large corpus (1B+ vectors), so I'd be quite interested to hear what worked well for others. | ||
| ▲ | kgeist 5 hours ago | parent [-] | |
I experimented with chunking strategies (how to split, what size chunks should be, how much they should overlap, etc.), query rewriting (one query produces several subqueries to explore different possibilities/search paths in parallel), the number of items at each stage (how many documents to retrieve at the embedding stage vs. the reranker stage), what weights to use for BM25 vs. vector search (i.e. what influences the hybrid score more), how to merge subresults from different parallel paths, etc. It was tuned for a specific set of open-source models we run ourselves on our own GPUs, so I can't share exact golden numbers (for example, if I replace those small models with Claude Haiku+Cohere Embed, the results get worse). A proper reranker helped tremendously because it removed noise; BM25 helped a lot too, because in many cases you want exact-match searches instead of fuzzy/vector search (so again, less noise). For small open-source models (we used them because we wanted speed), prompt engineering mattered too, especially in cross-language benchmarks where the model may get confused about which language it should respond in (the system prompt's language, the user query's language, or the documents' language). What mattered was even the order of fields in the output JSON schema (in intermediate steps), because LLMs are autoregressive: if you order the fields incorrectly, the model may guess or hallucinate during extraction when the first value in the schema can't be extracted reliably without first considering other dependent fields that should've been extracted earlier (we don't use reasoning models to save on speed). I used LLM-as-a-judge to quickly figure out what improved scores and what didn't. Then humans tested it manually too and calculated scores to see whether their scores diverged from the machine's scores. I think if I had to do it again, I probably would use an agent (like autoresearch) to autonomously find the best configuration for the exact set of models via intelligent bruteforce (dunno if it would work, but interesting to try). We don't have 1B+ vectors; our system is split into tenants (organizations), a single tenant usually doesn't have that many vectors, plus every document in the system has a specific hierarchical structure, so your mileage may vary | ||