> The model needs to "understand" geometry and physics to output a video.
No it doesn't. It merely needs to mimic.