I often ran into an error where multimodal models would refuse to operate in transcription mode due to some system prompt.