| ▲ | simonw 6 hours ago |
| If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice. I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/ |
|
| ▲ | KolmogorovComp 3 minutes ago | parent | next [-] |
| Hello, the recording you posted does not tell much about the cloning capability without an example from your real voice. |
|
| ▲ | javier123454321 5 hours ago | parent | prev | next [-] |
| This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it. |
| |
| ▲ | rdtsc 5 hours ago | parent | next [-] | | That was my thought too. You’d have “loved ones” calling with their faces and voices asking for money in some emergency. But you’d also have plausible deniability as anything digital can be brushed off as “that’s not evidence, it could be AI generated”. | | |
| ▲ | rpdillon 40 minutes ago | parent | next [-] | | Only if you focus on the form instead of the content. For a long time my family has had secret words and phrases we use to identify ourselves to each other over secure, but unauthenticated, channels (i.e. the channel is encrypted, but the source is unknown). The military has had to deal with this for some time, and developed various form of IFF that allies could use to identify themselves. E.g. for returning aircraft, a sequence of wing movements that identified you as friend. I think for a small group (in this case, loved ones), this could be one mitigation of that risk. My parents did this with me as a kid, ostensibly as a defense against some other adult saying "My mom sent me to pick you up...". I never did hear of that happening, though. | |
| ▲ | neevans 5 hours ago | parent | prev [-] | | this was already possible with chatterbox for a long while. | | |
| ▲ | freedomben 4 hours ago | parent | next [-] | | Yep, this has been the reality now for years. Scammers have already had access to it. I remember an article years ago about a grandma who wired her life savings to a scammer who claimed to have her granddaughter held hostage in a foreign country. Turns out they just cloned her voice from Facebook data and knew her schedule so timed it while she would be unreachable by phone. | |
| ▲ | DANmode 4 hours ago | parent | prev [-] | | or anyone who refuses to use hearing aids. |
|
| |
| ▲ | u8080 an hour ago | parent | prev | next [-] | | https://www.youtube.com/watch?v=diboERFAjkE pretty much this | | | |
| ▲ | fridder 2 hours ago | parent | prev | next [-] | | Admittedly I have not dove into it much but, I wonder if we might finally have a usecase for NFTs and web3? We need some sort of way to denote items are persion generated not AI. Would certainly be easier than trying to determine if something is AI generated | | |
| ▲ | simonw an hour ago | parent | next [-] | | How would NFTs/web3 help differentiate between something created by a human and something that a human created with AI and then tagged with their signature using those tools? | |
| ▲ | grumbel 2 hours ago | parent | prev [-] | | That's the idea behind C2PA[1], your camera and the tools put a signature on the media to prove its provenance. That doesn't make manipulation impossible (e.g. you could photograph an AI image of a screen), but it does give you a trail of where a photo came from and thus an easier way to filter it or lookup the original. [1] https://c2pa.org/ |
| |
| ▲ | oceanplexian 3 hours ago | parent | prev | next [-] | | > This is terrifying. Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc. | | |
| ▲ | javier123454321 2 hours ago | parent [-] | | I do strongly agree. Though the societal impact is only mitigated by open models, not curtailed at all. |
| |
| ▲ | echelon 4 hours ago | parent | prev [-] | | We're going to be okay. There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music. Nothing was more scary than the invention of the nuclear weapon. And we're all still here. Life will go on. And there will be incredible benefits that come out of this. | | |
| ▲ | javier123454321 3 hours ago | parent | next [-] | | I'm not denigrating the tech, all I'm saying is that we've crossed to new territory and there will be consequences that we don't understand from this. The same way that social media has been particularly detrimental to young people (especially women) in a way we were not ready for. This __smells__ like it could be worse, alongside with (or regardless of) the benefits of both. I simply think people don't really know that the new world requires a new set of rules of engagement for anything that exists behind a screen (for now). | |
| ▲ | supern0va 4 hours ago | parent | prev | next [-] | | We'll be okay eventually, when society adapts to this and becomes fully aware of the capabilities and the use cases for abuse. But, that may take some time. The parent is right to be concerned about the interim, at the very least. That said, I am likewise looking forward to the cool things to come out of this. | |
| ▲ | DANmode 4 hours ago | parent | prev [-] | | > People that couldn't sing will make music. I was with you, until But, yeah. Life will go on. | | |
| ▲ | echelon 4 hours ago | parent [-] | | There are plenty of electronic artists who can't sing. Right now they have to hire someone else to do the singing for them, but I'd wager a lot of them would like to own their music end-to-end. I would. I'm a filmmaker. I've done it photons-on-glass production for fifteen years. Meisner trained, have performed every role from cast to crew. I'm elated that these tools are going to enable me to do more with a smaller budget. To have more autonomy and creative control. | | |
| ▲ | javier123454321 3 hours ago | parent | next [-] | | Yes, the flipside of this is that we're eroding the last bit of ability for people to make a living through their art. We are capturing the market for people to live off of making illustrations, to making background music, jingles, promotional videos, photographs, graphic design, and funnelling those earnings to NVIDIA. The question I keep asking is whether we care to value as a society for people to make a living through their art. I think there is a reason to care. It's not so much of an issue with art for art's sake aided by AI. It's an issue with artistic work becoming unviable work. | | |
| ▲ | volkercraig 2 hours ago | parent [-] | | This feels like one of those tropes that keeps showing up whenever new tech comes out. At the advent of recorded music, im sure buskers and performers were complaing that live music is dead forever. Stage actors were probably complaining that film killed plays. Heck, I bet someome even complained that video itself killed the radio star. Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around, theyre just called v-tubers and podcasters. | | |
| ▲ | javier123454321 2 hours ago | parent [-] | | umm, I don't know if you've seen the current state of trying to make a living with music but It's widely accepted as dire. Touring is a loss leader, putting out music for free doesn't pay, stream counts payouts are abysmally low. No one buys songs. All that is before the fact that streaming services are stuffing playlists with AI generated music to further reduce the payouts to artists. > Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around... Yes all those things still happen, but it's increasingly untenable to make a living through it. |
|
| |
| ▲ | DANmode an hour ago | parent | prev [-] | | What happens to lyricless electronica if suddenly every electronic artist has quality vocal-backing? Oh no. Maybe we did frig this up. |
|
|
|
|
|
| ▲ | magicalhippo 4 hours ago | parent | prev | next [-] |
| The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text. I presume this is due to using the base model, and not the one tuned for more expressiveness. edit: Or more likely, the demo not exposing the expressiveness controls. The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not. Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models. Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard. |
| |
| ▲ | dsrtslnd23 7 minutes ago | parent | next [-] | | Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices. | |
| ▲ | thedangler 2 hours ago | parent | prev [-] | | How did you do this locally?
Tools? Language? |
|
|
| ▲ | pseudosavant 4 hours ago | parent | prev | next [-] |
| Remarkable tech that is now accessible to almost anyone. My cloned voice sounded exactly like me. The uses for this will be from good to bad and everywhere in-between. A deceased grandmother reading "Good Night Moon" to grandkids, scamming people, the ability to create podcasts with your own voices from just prompts. |
|
| ▲ | kingstnap 2 hours ago | parent | prev | next [-] |
| It was fun to try out. I wonder if at some point if I have a few minutes of me talking I could make myself read an entire book to myself. |
|
| ▲ | mohsen1 4 hours ago | parent | prev [-] |
| > The requested GPU duration (180s) is larger than the maximum allowed What am I doing wrong? |
| |