I'd figured it performed image recognition on the scene visible to it, then told the language model it could see various ingredients including some combined in a bowl.