Remix.run Logo
stared a day ago

If vectors life in an effectively lower space that they could, they don't live up to their n-dimensional potential.

Sometimes these things are patched with cosine distance (or even - Pearson correlation), vide https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity. Ideally when we don't need to and vectors occupy the space.

I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.