▲ | minimaxir 4 days ago | |||||||
It's likely because the definition of "similar" varies, and it doesn't necessarily mean semantic similarity. Depending on how the embedding model was trained, just texts with a similar format/syntax are indeed "similar" on that axis. The absolute value of cosine similarity isn't critical (just the order when comparing multiple candidates), but if you finetune an embeddings model for a specific domain, the model will give a wider range of cosine similarity since it can learn which attributes specifically are similar/dissimilar. | ||||||||
▲ | teepo 4 days ago | parent [-] | |||||||
Thanks - that helped it click a bit more. If the relative ordering is correct it doesn't matter they look so compressed. | ||||||||
|