Remix.run Logo
aspenmartin a day ago

I don’t think for this approach it sounds like, this is related to the large concept model: https://arxiv.org/abs/2412.08821, where the latent space is SONAR, which is very much designed for this purpose. You learn SONAR embeddings so that every sentence with the same semantic meaning gets mapped to the same latent representation. So you can have e.g. a French SONAR encoder and a Finnish SONAR encoder, trained separately with large scale corpi of paired sentences with the same meaning (basically the same thing you would use for learning translation models directly, but for SONAR you don’t need to train a single model per pair of languages). The LCM then works in this language-agnostic SONAR space which means it does (in principle) learn concepts from texts or speech in all supported languages