I'm struggling to understand what this does.

> Generates future observations and action sequences.

Is that just a complicated way of saying video gen?

heliosAtwork 25 minutes ago | parent | next [-]

It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.

▲

swiftcoder 2 hours ago | parent | prev | next [-]

As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video

▲

derac 2 hours ago | parent | prev | next [-]

Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.

	▲	causal an hour ago \| parent [-]
		That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?

▲

ainch an hour ago | parent | prev [-]

You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.