▲ | porphyra 2 days ago | |
It seems that end to end neural networks for robotics are really taking off. Can someone point me towards where to learn about these, what the state of the art architectures look like, etc? Do they just convert the video into a stream of tokens, run it through a transformer, and output a stream of tokens? | ||
▲ | vessenes 2 days ago | parent [-] | |
I was reading their site, and I too have some questions about this architecture. I'd be very interested to see what the output of their 'big model' is that feeds into the small model. I presume the small model gets a bunch of environmental input, and some input from the big model, and we know that the big model input only updates every 30 or 40 frames in terms of small model. Like, do they just output random control tokens from big model and embed those in small model and do gradient descent to find a good control 'language'? Do they train the small model on english tokens and have the big model output those? Custom coordinates tokens? (probably). Lots of interesting possibilities here. By the way, the dataset they describe was generated by a large (much larger presumably) vision model tasked with creating tasks from successful videos. So the pipeline is: * Video of robot doing something * (o1 or some other high end model) "describe very precisely the task the robot was given" * o1 output -> 7B model -> small model -> loss |