| ▲ | avaer 7 hours ago | |||||||
Gemini 3 is the only model I've found that can reason spatially. The results here are accurate to my experiments with putting LLM NPCs in simulated worlds. I was surprised that most VLLMs cannot reliably tell if a character is facing left or right, they will confidently lie no matter what you do (even gemini 3 cannot do it reliably). I guess it's just not in the training data. That said Qwen3VL models are smaller/faster and better "spatially grounded" in pixel space, because pixel coordinates are encoded in the tokens. So you can use them for detecting things in the scene, and where they are (which you can project to 3d space if you are running a sim). But they are not good reasoning models so don't ask them to think. That means the best pipeline I've found at the moment is to tack a dumb detection prepass on before your action reasoning. This basically turns 3d sims into 1d text sims operating on labels -- which is something that LLMs are good at. | ||||||||
| ▲ | storystarling an hour ago | parent | next [-] | |||||||
I suspect the latency on Gemini 3 makes it non-viable for a real-time control loop though. Even if the reasoning works, the input token costs would destroy the unit economics pretty quickly. I'd be worried about relying on that kind of API overhead for the critical path. | ||||||||
| ||||||||
| ▲ | Krutonium 7 hours ago | parent | prev [-] | |||||||
Neuro-sama, the V-Tuber/AI actually does a decent job of it. Vedal seems to have cooked and figured out how to make an LLM move reasonably well in VRChat. Not perfectly, there's a lot abuse of gravity or the lack thereof, but yeah. Neuro has also piloted a Robot Dog in the past. | ||||||||