Remix.run Logo
yorwba 3 days ago

I don't think Mixture of Logits from the paper you link circumvents the theoretical limitations pointed out here, since their dataset size mostly stays well below the limit.

In the end they still rely on Maximum Inner Product Search, just with several lookups for smaller partitions of the full embedding, and the largest dataset is Books, where this paper suggests you'd need more than 512 embedding dimensions, and MoL with 256-dimensional embeddings split into 8 parts of 32 each has an abysmal hit rate.

So that's hardly a demonstration that arbitrary high-rank distributions can be approximated well. MoL seems to approximate it better than other approaches, but all of them are clearly hampered by the small embedding size.

lunarmony 18 minutes ago | parent [-]

Mixture of Logits was actually already deployed on 100M scale+ datasets at Meta and at LinkedIn (https://arxiv.org/abs/2306.04039 https://arxiv.org/abs/2407.13218 etc.). The crucial departure from traditional embedding/multi-embedding approaches is in learning a query-/item- dependent gating function, which enables MoL to become a universal high-rank approximator (assuming we care about recall@1) even when the input embeddings are low rank.