Remix clone Hacker News

new | show | ask | jobs Github

	▲	spwa4 5 hours ago
		It's called "VLA" (vision-language-action) models: https://huggingface.co/models?pipeline_tag=robotics VLA models essentially take a webcam screenshot + some text (think "put the red block in the right box") and output motor control instructions to achieve that. Note: "Gemini Robotics-ER" is not a VLA, though Gemini does have a VLA model too: "Gemini Robotics". A demo: https://www.youtube.com/watch?v=DeBLc2D6bvg