| ▲ | spwa4 5 hours ago | |
It's called "VLA" (vision-language-action) models: https://huggingface.co/models?pipeline_tag=robotics VLA models essentially take a webcam screenshot + some text (think "put the red block in the right box") and output motor control instructions to achieve that. Note: "Gemini Robotics-ER" is not a VLA, though Gemini does have a VLA model too: "Gemini Robotics". | ||