I read both those articles, but I still don't get how to do it. It seems the idea is that more of the embedding is informed by context, but do I _do_ late chunking?

My best guess so far is that somehow I embed a long text and then I break up the returned embedding into multiple parts and search each separately? But that doesn't sound right.

▲

lsorber 7 months ago | parent | next [-]

The name ‘late chunking’ is indeed somewhat of a misnomer in the sense that the technique does not partition documents into document chunks. What it actually does is to pool token embeddings (of a large context) into say sentence embeddings. The result is that your document is now represented as a sequence of sentence embeddings, each of which is informed by the other sentences in the document.

Then, you want to parition the document into chunks. Late chunking pairs really well with semantic chunking because it can use late chunking's improved sentence embeddings to find semantically more cohesive chunks. In fact, you can cast this as a binary integer programming problem and find the ‘best’ chunks this way. See RAGLite [1] for an implementation of both techniques including the formulation of semantic chunking as an optimization problem.

Finally, you have a sequence of document chunks, each represented as a multi-vector sequence of sentence embeddings. You could choose to pool these sentence embeddings into a single embedding vector per chunk. Or, you could leave the multi-vector chunk embeddings as-is and apply a more advanced querying technique like ColBERT's MaxSim [2].

[1] https://github.com/superlinear-ai/raglite

[2] https://huggingface.co/blog/fsommers/document-similarity-col...

▲

causal 7 months ago | parent [-]

What does it mean to "pool" embeddings? The first article seems to assume the reader is familiar

	▲	deepsquirrelnet 7 months ago \| parent [-]
		“Pooling” is just aggregation methods. It could mean taking max or average values, or more exotic methods like attention pooling. It’s meant to reduce the one-per-token dimensionality to one per passage or document.

▲

_hl_ 7 months ago | parent | prev [-]

You’d need to go a level below the API that most embedding services expose.

A transformer-based embedding model doesn’t just give you a vector for the entire input string, it gives you vectors for each token. These are then “pooled” together (eg averaged, or max-pooled, or other strategies) to reduce these many vectors down into a single vector.

Late chunking means changing this reduction to yield many vectors instead of just one.