That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?