▲ | WithinReason 6 days ago | ||||||||||||||||||||||
> We are provided visual cues that encode depth. The ideal world model would at least have this. None of these world models have explicit concepts of depth or 3D structure, and adding it would go against the principle of the Bitter Lesson. Even with 2 stereo captures there is no explicit 3D structure. | |||||||||||||||||||||||
▲ | soulofmischief 6 days ago | parent [-] | ||||||||||||||||||||||
Increasing the fidelity and richness of training data does not go against the bitter lesson. The model can learn 3D representation on its own from stereo captures, but there is still richer, more connected data to learn from with stereo captures vs monocular captures. This is unarguable. You're needlessly making things harder by forcing the model to also learn to estimate depth from monocular images, and robbing it of a channel for error-correction in the case of faulty real-world data. | |||||||||||||||||||||||
|