> We are provided visual cues that encode depth. The ideal world model would at least have this.

None of these world models have explicit concepts of depth or 3D structure, and adding it would go against the principle of the Bitter Lesson. Even with 2 stereo captures there is no explicit 3D structure.

▲

soulofmischief 6 days ago | parent [-]

Increasing the fidelity and richness of training data does not go against the bitter lesson.

The model can learn 3D representation on its own from stereo captures, but there is still richer, more connected data to learn from with stereo captures vs monocular captures. This is unarguable.

You're needlessly making things harder by forcing the model to also learn to estimate depth from monocular images, and robbing it of a channel for error-correction in the case of faulty real-world data.

▲

WithinReason 6 days ago | parent [-]

Stereo images have no explicit 3D information and are just 2D sensor data. But even if you wanted to use stereo data, you would restrict yourself to stereo datasets and wouldn't be able to use 99.9% of video data out there to train on which wasn't captured in stereo, that's the part that's against the Bitter Lesson.

	▲	soulofmischief 6 days ago \| parent \| next [-]
		You don't have to restrict yourself to that, you can create synthetic data or just train on both kinds of data. I still don't understand what the bitter lesson has to do with this. First of all, it's only a piece of writing, not dogma, and second of all it concerns itself with algorithms and model structure itself, increasing the amount of data available to train on does not conflict with it.
	▲	6 days ago \| parent \| prev [-]
		[deleted]