▲ | tim333 7 days ago | |
There are similarities with that one. From their website: >It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. my point is more people can try different models and algorithms rather than having to stick to LLMs. |