▲ | lsorber 2 days ago | |||||||||||||||||||||||||||||||
You don’t have to reduce a long context to a single embedding vector. Instead, you can compute the token embeddings of a long context and then pool those into say sentence embeddings. The benefit is that each sentence’s embedding is informed by all of the other sentences in the context. So when a sentence refers to “The company” for example, the sentence embedding will have captured which company that is based on the other sentences in the context. This technique is called ‘late chunking’ [1], and is based on another technique called ‘late interaction’ [2]. And you can combine late chunking (to pool token embeddings) with semantic chunking (to partition the document) for even better retrieval results. For an example implementation that applies both techniques, check out RAGLite [3]. [1] https://weaviate.io/blog/late-chunking [2] https://jina.ai/news/what-is-colbert-and-late-interaction-an... | ||||||||||||||||||||||||||||||||
▲ | visarga 2 days ago | parent | next [-] | |||||||||||||||||||||||||||||||
You can achieve the same effect by using LLM to do question answering prior to embedding, it's much more flexible but slower, you can use CoT, or even graph rag. Late chunking is a faster implicit alternative. | ||||||||||||||||||||||||||||||||
▲ | voiper1 2 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||
I read both those articles, but I still don't get how to do it. It seems the idea is that more of the embedding is informed by context, but do I _do_ late chunking? My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right. | ||||||||||||||||||||||||||||||||
|