Remix.run Logo
andrewrn 5 hours ago

Something to note here that annoys me about the title is that the LLMs aren't taking in the raw data (LLM's are for text, after all). The raw data is fed through audio and motion models that then produce natural language descriptions, that are then fed to the LLM.

Unrelated: yeah, this article is a little creepy, but damn is it interesting technically.