Vision is still much weaker than text for LLMs. So you could argue we already have AGI for text but not vision inputs, or you could argue AGI requires being human level at text vision and sound.