Remix.run Logo
dust42 8 hours ago

Good quality but unfortunately it is single language English only.

phoronixrly 8 hours ago | parent [-]

I echo this. For a TTS system to be in any way useful outside the tiny population of the world that speaks exclusively English, it must be multilingual and dynamically switch between languages pretty much per word.

Cool tech demo though!

bingaweek 5 hours ago | parent | next [-]

This is a great illustration that nothing you ever do will be good enough without people whining.

kamranjon 8 hours ago | parent | prev | next [-]

That's a pretty crazy requirement for something to be "useful" especially something that runs so efficiently on cpu. Many content creators from non-english speaking countries can benefit from this type of release by translating transcripts of their content to english and then running it through a model like this to dub their videos in a language that can reach many more people.

ethin 6 hours ago | parent | next [-]

Uh, no? This is not at all an absurd requirement? Screen readers literally do this all the time, with voices that are the classic way of making a speech synthesizer, no AI required. ESpeak is an example, or MS OneCore. The NVDA screen reader has an option for automatic language switching as does pretty much every other modern screen reader in existence. And absolutely none of these use AI models to do that switching, either.

kube-system 3 hours ago | parent [-]

They didn’t say it was a crazy requirement. They said it was crazy to consider it useless without meeting that requirement.

ethin 2 hours ago | parent [-]

That doesn't really change what I said though. It isn't crazy to call it useless without some form of ALS either. Given that old school synthesis has been able to do it for like 20 years or so.

phoronixrly 7 hours ago | parent | prev [-]

You mean youtubers? And have to (manually) synchronise the text to their video, and especially when youtube apparently offers voice-voice translation out of the box to my and many others' annoyance?

littlestymaar 13 minutes ago | parent [-]

YouTube's voice to voice is absolutely horrible though. Having the ability for the youtubers to clone their own voice would make it much, much more appealing.

numpad0 2 hours ago | parent | prev | next [-]

> it must be multilingual and dynamically switch between languages pretty much per word

Not abundantly obviously a satire and so interjecting: humans, including professional "simultaneous" interpreters, can't do this. This is not how languages work.

koakuma-chan an hour ago | parent [-]

You can speak one language, switch to another language for one word, and continue speaking in the previous language.

Levitz 7 hours ago | parent | prev | next [-]

But it wouldn't be for those who "speak exclusively English", rather, for those who speak English. Not only that but it's also common to have system language set to English, even if one's language is different.

There's about 1.5B English speakers in the planet.

phoronixrly 7 hours ago | parent [-]

Let's indeed limit the use case to the system language, let's say of a mobile phone.

You pull up a map and start navigation. All the street names are in the local language, and no, transliterating the local names to the English alphabet does not make them understandable when spoken by TTS. And not to mention localised foreign names which then are completely mangled by transliterating them to English.

You pull up a browser, open up an news article in your local language to read during your commute. You now have to reach for a translation model first before passing the data to the English-only TTS software.

You're driving, one of your friends Signals you. Your phone UI is in English, you get a notification (interrupting your Spotify) saying 'Signal message', followed by 5 minutes of gibberish.

But let's say you have a TTS model that supports your local language natively. Well due to the fact that '1.5B English speakers' apparently exist in the planet, many texts in other languages include English or Latin names and words. Now you have the opposite issue -- your TTS software needs to switch to English to pronounce these correctly...

And mind you, these are just very simple use cases for TTS. If you delve into use cases for people with limited sight that experience the entire Internet, and all mobile and desktop applications (often having poor localisation) via TTS you see how mono-lingual TTS is mostly useless and would be switched for a robotic old-school TTS in a flash...

> only that but it's also common to have system language set to English

Ask a German whether their system language is English. Ask a French person. I can go on.

numpad0 an hour ago | parent [-]

If you don't speak the local language anyway, you can't decode pronounced spoken local language names anyway. Your speech sub-systems can't lock and sync to the audio track containing languages you don't speak. Let alone transliterate or pronounce.

Multilingual doesn't mean language agnostic. We humans are always monolingual, just multi-language hot-swappable if trained. It's more like you can make;make install docker, after which you can attach/detach into/out of alternate environments while on terminal to do things or take in/out notes.

People sometimes picture multilingualism as owning a single joined-together super-language in the brain. That usually doesn't happen. Attempting this especially at young age could lead a person into a "semi-lingual" or "double-limited" state where they are not so fluent or intelligent in any particular languages.

And so, trying to make an omnilingual TTS for criticizing someone not devoting significant resources at it, don't make much sense.

echelon 7 hours ago | parent | prev | next [-]

English has more users than all but a few products.

knowitnone3 6 hours ago | parent | prev [-]

I'm Martian so everything you create better support my language on day 1