| ▲ | sosodev 3 days ago | |
Huh, you're right. I tried your test and it clearly can't understand the difference between homophones. That seems to imply they're using some sort of TTS mechanism. Which is really weird because Qwen3-Omni claims to support direct audio input into the model. Maybe it's a cost saving measure? | ||
| ▲ | sosodev 3 days ago | parent | next [-] | |
Weirdly, I just tried it again and it seems to understand the difference between record and record just fine. Perhaps if there's heavy demand for voice chat, like after a new release, they load shed by using TTS to a smaller model. However, It still doesn't seem capable of producing any of the sounds, like laughter, that I would expect from a native voice model. | ||
| ▲ | potatoman22 2 days ago | parent | prev [-] | |
To be fair, discerning heteronyms might just be a gap in its training. | ||