Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.