LLMs are trained on text. Why would we expect them to understand a visual and tactile 3D world?
Because they’re also multimodal vLLMs.