| ▲ | KoolKat23 3 days ago | |
1. You do. You probably have a different version of that and are saying I'm wrong merely for not holding your definition. 2. That directly addresses your point. In abstract it shows they're basically no different to multimodal models, train with different data types and it still works, perhaps even better. They train LLMs with images, videos, sound, and nowadays even robot sensor feedback, with no fundamental changes to the architecture see Gemini 2.5. 3. That's merely an additional input point, give it sensors or have a human relay that data. Your toe is relaying it's sensor information to your brain. | ||