Cool project! That validation loss curve screams train set memorization without generalization ability.

Too little train data, and/or data of insufficient quality. Maybe let the robot run autonomously with an (expensive) VLM operating it to bootstrap a larger train dataset without needing to annotate it yourself.

Or maybe the problem itself is poorly specified, or intractable with your chosen network architecture. But if you see that a vision llm can pilot the bot, at least you know you have a fighting chance.

▲

indraneelpatil 4 hours ago | parent [-]

Thanks! Its probably both, too little train data and insufficient quality.

Thats a cool idea, is there any VLM you would suggest? I can think of Gemini maybe? Or any would do?

	▲	isoprophlex 3 hours ago \| parent [-]
		My gut feeling says: cheap gemini model will be fine. Try the cheapest you can find, go more expensive if at first you don't succeed. invest in a good prompt describing the setup, your goals, when to move. Type your output, don't go parsing move commands out of unstructured chat output. And maybe validate first on the data you already collected: does the vlm take the same actions as your existing train set? And then just let it run and collect data for as long as you can afford. Maybe 0.2 fps (sample and take action every 5 sec) is already good enough. Good luck!