| ▲ | derac 2 hours ago | |
Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions. | ||
| ▲ | causal an hour ago | parent [-] | |
That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text? | ||