Removing 'um' from a recording is harder than it sounds

It’s a nice engineering approach, but I’m interested in the motivation. Um and ah is distracting in a transcript, where you can naturally pause to take in information; in speech however it can serve as a focusing point to indicate the next part is important. See https://medium.com/better-humans/dont-worry-about-saying-um-... for example. The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.

Disfluencies aren’t necessarily bad even if the word starts with “dis”!

▲

bluebarbet 8 minutes ago | parent | next [-]

The most popular academic theory (IIRC) is that "um" and "uh" are conversational placeholders that say, "don't talk, I'm not finished speaking yet". Which obviously serves no purpose in a monologue.

To me they just indicate lack of confidence on the part of the speaker.

▲

toast0 4 hours ago | parent | prev | next [-]

Having heard radio interviews with and without 'internal editing' to remove ums and ahs, most of the time I'd rather the edited version. It's more concise and focused, and I find it easier to comprehend. Too many ums and ahs and my mind wanders, and if it's radio, I can't go easily go back to try again. When I've listened to podcasts or audiobooks, I could never easily go back a little to try again either, and I gave up on them (even though I have some content I really want to listen to, it's too frustrating, so it's not happening). But I'm sure other people have different preferences.

I also don't care for writing that could have been made a lot more concise. It's a lot of work to make things shorter, but I think it's worthwhile.

	▲	venzaspa an hour ago \| parent [-]
		It just goes to show that people have very different views. I think when I hear people thinking out loud (ums and ahs) it's a marker that they are actually engaging with the question, thinking through an answer and not bullshitting without thinking.

▲

NooneAtAll3 2 hours ago | parent | prev | next [-]

> in speech however it can serve as a focusing point to indicate the next part is important

it's... exact opposite?

the main (attempted) use for ummms is to keep continuation of speech despite the pause. And the main complaint is exactly that it ruins the focus and doesn't give respite

▲

amelius 2 hours ago | parent | prev | next [-]

As with all things ... Don't be opinionated and make it an option for the user.

▲

siriaan 5 hours ago | parent | prev | next [-]

Occasional ums and ahs are fine but when every other phrase starts with a long aaaaah it can be pretty unpleasant to listen to.

	▲	sans_souse 5 hours ago \| parent [-]
		So, if this project's source Audio were Beavis and Butthead, you would be enthused?

▲

mrob 3 hours ago | parent | prev [-]

>The weirdly obsessive zeal that orgs like Toastmasters have about eliminating them is weird.

If you speak with disfluencies, you probably didn't sufficiently rehearse your speech. If you didn't rehearse enough, you probably didn't put much effort into writing it either, so why should I put much effort into listening? It's the same principle as AI slop.

	▲	kaashif an hour ago \| parent [-]
		Not necessarily true, more rehearsal isn't the key to fluent oratory. Many people can speak off the cuff fluently and confidently, avoiding "like", "um", and other filler words. And even if you're not speaking fluently, leaving silences as punctuation is more effective, IMO. Many impressive speakers I've met actually cite Toastmasters! So their obsessive zeal actually does work. More rehearsal does work too sometimes, but it does sometimes lead to speeches "sounding too rehearsed".

▲

rbbydotdev 35 minutes ago | parent | prev | next [-]

I wonder if with enough input data and transcription you could “fingerprint” where a speaker personality has habits of interjecting “ums” leading to more hardy analysis. Novel approach, but gets me thinking

▲

chrismorgan an hour ago | parent | prev | next [-]

I think the “What it won’t touch” section shows why the entire concept is unsound. Here it is with a different first sentence, and (other than the third sentence no longer matching erm’s reality) it’s perfectly coherent:

> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone. Those sound like fillers but they’re doing real work in the sentence, and cutting them automatically would change what someone said. The rule erm follows: only remove things that are sound, not language.

> It also doesn’t touch repeated words, false starts, or long thinking pauses. Those aren’t noise on top of the speech; they are the speech, just messier than the speaker would like. Cleaning them up is an editorial decision about which take to keep, and erm doesn’t have an opinion about that.

Think about it. Cleaning these things-that-can-be-just-sounds-but-can-also-very-much-be-load-bearing up is an editorial decision. At the very least, you need to judge based on the surrounding content whether the removal of an um would change the meaning at all; and I don’t think text alone is adequate for that.

▲

thaumasiotes 44 minutes ago | parent [-]

>> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone.

Something's already gone wrong here. Uh and er refer to the same sound. Uh is the American spelling. Er is British; to them a following "r" like that is just a kind of vowel.

▲

chrismorgan 29 minutes ago | parent [-]

Um… no. Quite different vowel sounds.

(Also, in case it wasn’t clear: I was quoting from the start of the article in that sentence.)

	▲	thaumasiotes 19 minutes ago \| parent [-]
		They're quite different vowel sounds in the same sense that "back" and "back" use "quite different vowel sounds" when pronounced by American vs British speakers. But not in any other sense. > in case it wasn’t clear: I was quoting from the start of the article in that sentence. You don't seem to be quoting from the article at all, actually. You've combined two different sentences in a way that grossly misrepresents what the article says. But that's not really relevant to the point here.

▲

heroprotagonist 7 hours ago | parent | prev | next [-]

Not to promote something, but Wispr Flow does that for me automatically if I trigger a setting for it..

While it's a commercial product with a subscription, I spent a long time on the free tier not even hitting their limits until I started using it so extensively that I wanted to pay for it.

And I've used Whisper in the past, mostly for tinkering. I tried it for a couple of use cases but haven't touched the base project in a while. But I do regularly use Faster-Whisper-XXL, an open source project based on Whisper, for subtitle generation.

Though, for subtitle generation, I decided to support the project and mainly use the non-public build of Faster-Whisper-XXL Pro built for donators to the open source project.

The extra features smooth out the subtitle editing process very substantially. Toss in "--roformer_overlap 0.125 --roformer_vram 16 --best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3" to the cli parameters (and sometimes --realign) and you have much less work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.

	▲	iib an hour ago \| parent \| next [-]
		Surprisingly, it's the whisper model itself that does that. I find that it's also good with false starts, often correcting something like: "uhm, we could...we can go there" to just "we can go there", if spoken rapidly enough.
	▲	dotancohen 3 hours ago \| parent \| prev [-]
		Is love to hear more about subtitle generation. Specifically, can you label different speakers? I'd be using this for meeting transcription. Thank you.

▲

supernes 5 hours ago | parent | prev | next [-]

This approach seems kind of backwards to me. Why try to detect everything except the thing you're trying to remove instead of either sampling a few uhs and ums and treating them as noise to be silenced (with a sharp crossfade to the noise floor that doesn't interrupt speech flow) or finetuning a model to detect them specifically for full automation?

▲

monster_truck 30 minutes ago | parent | prev | next [-]

It takes about 30 seconds in Audacity and will give an infinitely better result. Also works on any other sound

	▲	HeavyStorm 25 minutes ago \| parent [-]
		Doesn't sound true. Unless audacity already has a tool for this exactly... How would you do it on 30 seconds or less?

▲

HeavyStorm 27 minutes ago | parent | prev | next [-]

What a very cool utility.

▲

rindalir 8 hours ago | parent | prev | next [-]

This is fascinating! I'm going to try this on a certain clip from Jurassic Park.

▲

lavaman131 4 hours ago | parent | prev | next [-]

This is great, I've tried out automated podcast editing tools before and they cut too aggressively in my experience. What are you thinking about doing next with this now that you've gotten the alignment snapping working cleanly for 'um' and 'ah', are you thinking of expanding the tool?

▲

alok-g 6 hours ago | parent | prev | next [-]

I would love to see support for videos and removal of custom filler words (I say 'basically' and 'like' a lot and have so far failed to improve myself on this).

▲

cadamsdotcom 7 hours ago | parent | prev | next [-]

What an awesome tool and idea. I’d be keen to see if it can integrate with video editing tools.

Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!

▲

cyberax an hour ago | parent | prev | next [-]

BTW, any recommendations for AI tools that remove the laugh track? I don't even mind the awkward acting without the missing laughter.

▲

sciencesama 7 hours ago | parent | prev | next [-]

there is a aah counter in toast master !! this is the software that helps !!

▲

npodbielski 5 hours ago | parent | prev | next [-]

I think it is harder to remove those from your own speech. I have been doing that for few months now and I still get back at it when I am in hurry or stressed.

▲

cryptoz 7 hours ago | parent | prev | next [-]

Really cool stuff and definitely going to try it; I’m also finding it wild that Google put effort into adding ums and erms into their text to speech model a while back. AI puts it in, AI helps take it out.

▲

sublinear 7 hours ago | parent | prev | next [-]

Disfluencies are not necessarily "filler". They can convey mood or hesitation. Cutting them can change the meaning.

A trivial example is "umm... well... (sigh) okay" versus just "okay". Not okay!

▲

dougcalobrisi 9 hours ago | parent | prev [-]

This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn't a find and replace type thing, because Whisper's timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper's guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there's no click, with ffmpeg doing the splicing.

https://github.com/dougcalobrisi/erm