I don’t know about open models, but ElevenLabs has had this idea of mapping intonation/emotion/inflections onto a designated TTS voice for a while.
https://elevenlabs.io/blog/speech-to-speech