Remix.run Logo
londons_explore a day ago

Does this have the ability to edit historic words as more info becomes available?

Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".

Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".

Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.

yvdriess a day ago | parent | next [-]

A good opportunity to point people to the paper with my favorite title of all time:

"How to wreck a nice beach you sing calm incense"

https://dl.acm.org/doi/10.1145/1040830.1040898

abound 21 hours ago | parent | next [-]

For folks like me puzzling over what the correct transcription of the title should be, I think it's "How to recognize speech using common sense"

strken 20 hours ago | parent | next [-]

Thank you! "Calm incense" makes very little sense when said in an accent where calm isn't pronounced like com.

solardev 12 hours ago | parent [-]

How is calm pronounced in those accents?

strken 11 hours ago | parent | next [-]

In Australian English, calm rhymes with farm and uses a long vowel, while com uses a short vowel and would rhyme with prom. (I know this doesn't help much because some American accents also rhyme prom with farm).

Consider the way "Commonwealth Bank" is pronounced in this news story: https://youtube.com/watch?v=MhkuHGRAAbg. An Australian English speaker would consider (most) Americans to be saying something like "Carmenwealth" rather "Commonwealth". See also the pronunciation of dog vs father in https://www.goalsenglish.com/lessons/2020/5/4/australian-eng....

It really ruins some poetry.

drited 11 hours ago | parent | prev | next [-]

Cahm

solardev 11 hours ago | parent [-]

Like the "cam" in "camera"?

yokljo 11 hours ago | parent [-]

I've been thinking about this for a minute, and I think if an American were to say "why", and take only the most open vowel sound from that word and put it between "k" and "m", you get a pretty decent Australian pronunciation. I am an Australian so I could be entirely wrong about how one pronounces "why".

Macha 8 hours ago | parent | prev [-]

call-mm

wdaher 17 hours ago | parent | prev | next [-]

This is the correct parsing of it. (I can't take credit for coming up with the title, but I worked on the project.)

codedokode 18 hours ago | parent | prev | next [-]

I only got the "How to recognize" part. Also I think "using" should sound more like "you zinc" than "you sing".

efilife 20 hours ago | parent | prev | next [-]

Thanks. Now I know that I'm not that stupid and this actually makes no sense

chipsrafferty 19 hours ago | parent [-]

It actually does make sense. Not saying you're stupid, but in standard English, if you say it quickly, the two sentences are nearly identical.

mjw_byrne 19 hours ago | parent | next [-]

They're pretty different in British English, I struggled to figure it out until I started thinking about how it would sound with an American accent.

codedokode 18 hours ago | parent | prev [-]

But in "you sing", "s" is pronounced as "s", not as "z" from "using", right?

squeaky-clean 13 hours ago | parent [-]

I pronounce using with an S unless I'm saying it very slowly

fiatjaf 21 hours ago | parent | prev [-]

Thank you very much!

fmx 21 hours ago | parent | prev | next [-]

The paper: https://sci-hub.st/https://dl.acm.org/doi/10.1145/1040830.10...

(Agree that the title is awesome, by the way!)

xyse53 18 hours ago | parent | prev | next [-]

My favorite is:

"Threesomes, with and without blame"

https://dl.acm.org/doi/10.1145/1570506.1570511

(From a professor I worked with a bit in grad school)

ThinkingGuy 16 hours ago | parent | prev | next [-]

Also relevant: The Two Ronnies - "Four Candles"

https://www.youtube.com/watch?v=gi_6SaqVQSw

brcmthrowaway 19 hours ago | parent | prev [-]

Do AI voice recognition still use markov models for this?

sva_ 18 hours ago | parent [-]

Whisper uses an encoder-decoder transformer.

Fluorescence 21 hours ago | parent | prev | next [-]

It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.

Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?

It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!

0cf8612b2e1e 19 hours ago | parent | next [-]

The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.

smallpipe 19 hours ago | parent [-]

A good subtitle isn't a perfect copy of what was said.

kstrauser 17 hours ago | parent | next [-]

Hard disagree. When I'm reading a transcript, I want word-for-word what the people said, not a creative edit. I want the speakers' voice, not the transcriptionist's.

And when I'm watching subtitles in my own language (say because I want the volume low so I'm not disturbing others), I hate when the words I see don't match the words I hear. It's the quickest way I can imagine to get sucked out of the content and into awareness of the delivery of the content.

crazygringo 16 hours ago | parent | next [-]

I mean, subtitles are mostly the same.

Sometimes they're edited down simply for space, because there wouldn't be time to easily read all the dialog otherwise. And sometimes repetition of words or phrases is removed, because it's clearer, and the emphasis is obvious from watching the moving image. And filler words like "uh" or "um" generally aren't included unless they were in the original script.

Most interestingly, swearing is sometimes toned down, just by skipping it -- removing an f-word in a sentence or similar. Not out of any kind of puritanism, but because swear words genuinely come across as more powerful in print than they do in speech. What sounds right when spoken can sometimes look like too much in print.

Subtitles are an art. Determining when to best time them, how to split up long sentences, how to handle different speakers, how to handle repetition, how to handle limited space. I used to want subtitles that were perfectly faithful to what was spoken. Then I actually got involved in making subtitles at one point, and was very surprised to discover that perfectly faithful subtitles didn't actually do the best job of communicating meaning.

Fictional subtitles aren't court transcripts. They serve the purpose of storytelling, which is the combination of a visible moving image full of emotion and action, and the subtitles. Their interplay is complex.

nomdep 7 hours ago | parent [-]

Hard and vehemently disagree. Subtitles are not commentary tracks.

The artists are the writers, voice actors, and everyone else involved in creating the original media. Never, ever, a random stranger should contaminate it with his/her opinions or point of views.

Subtitles should be perfect transcriptions or the most accurate translations, never reinterpretations

creesch 16 hours ago | parent | prev | next [-]

> When I'm reading a transcript

That's the thing though, subtitles aren't intended as full transcripts. They are intended to allow a wide variety of people to follow the content.

A lot of people read slower than they would hear speech. So subtitles often need to condense or rephrase speech to keep pace with the video. The goal is usually to convey meaning clearly within the time available on screen. Not to capture every single word.

If they tried to be fully verbatim, you'd either have subtitles disappearing before most viewers could finish reading them or large blocks of text covering the screen. Subtitlers also have to account for things like overlapping dialogue, filler words, and false starts, which can make exact transcriptions harder to read and more distracting in a visual medium.

I mean, yeah in your own native language I agree it sort of sucks if you can still hear the spoken words as well. But, to be frank, you are also the minority group here as far as subtitle target audiences go.

And to be honest, if they were fully verbatim, I'd wager you quickly would be annoyed as well. Simply because you will notice how much attention they then draw, making you less able to actually view the content.

iczero 15 hours ago | parent [-]

I regularly enable YouTube subtitles. Almost always, they are a 100% verbatim transcription, excluding errors from auto-transcription. I am not annoyed in the slightest, and in fact I very much prefer that they are verbatim.

If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.

ben_w 26 minutes ago | parent | next [-]

> If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.

And what are deaf people supposed to do in a cinema, or with broadcast TV?

(And I'm ignoring other uses, e.g. learning a foreign language; for that, sometimes you want the exact words, sometimes the gist, but it's highly situational; but even once you've learned the language itself, regional accents even without vocabulary changes can be tough).

creesch 13 hours ago | parent | prev [-]

> If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.

That's just plain tone deaf, plain and simple. I was not talking about myself, or just youtube. You are not everyone else, your use case is not everyone else their use case. It really isn't that difficult.

stavros 16 hours ago | parent | prev [-]

But then what about deliberate mishearings and ambiguous speech, like the GP said?

numpad0 15 hours ago | parent | prev | next [-]

Aren't same-language subtitles supposed to be perfect literal transcripts, while cross-language subtitling is supposed to be compressed creative interpretations?

herbcso 18 hours ago | parent | prev [-]

Tom Scott would agree with you. https://m.youtube.com/watch?v=pU9sHwNKc2c

dylan604 19 hours ago | parent | prev [-]

I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?

spauldo 16 hours ago | parent [-]

Writing in the vernacular, I believe it's called. I do something like that if I'm texting.

The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.

dylan604 16 hours ago | parent [-]

Except it forces me to slow down to "decypher" the text and makes the reading labored. I understand the point as it is part of the character, but it is easier to understand someone speaking in that vernacular vs reading the forced misspellings. I definitely don't want to get to the point of being good at reading it though. I wonder if this is how second grade teachers feel reading the class' schoolwork?

spauldo 14 hours ago | parent [-]

That's true. I'm sure Twain and Banks were aware of this, though. Apparently they considered the immersion to be worth a little extra work on the part of the reader. Whether the reader agrees is a different story.

I try to limit my use of it to just enough for my accent and way of talking to bleed through. I don't go for full-on phonetics, but I'm often "droppin' my g's and usin' lotsa regional sayin's." It probably helps that the people I text have the same accent I do, though.

ph4evers a day ago | parent | prev | next [-]

Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.

jeroenhd a day ago | parent | next [-]

The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):

    queue
    
         The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
londons_explore a day ago | parent [-]

so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!

I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.

The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.

miki123211 a day ago | parent | next [-]

The right way to do this would be to use longer, overlapping chunks.

E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).

This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.

superluserdo a day ago | parent | next [-]

I basically implemented exactly this on top of whisper since I couldn't find any implementation that allowed for live transcription.

https://tomwh.uk/git/whisper-chunk.git/

I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.

dylan604 19 hours ago | parent | prev [-]

If real-time transcription is so bad, why force it to be real-time. What happens if you give it a 2-3 second delay? That's pretty standard in live captioning. I get real-time being the ultimate goal, but we're not there yet. So working within the current limitations is piss poor transcription in real-time really more desirable/better than better transcriptions 2-3 second delay?

jeroenhd 16 hours ago | parent | prev | next [-]

I don't know an LLM that does context based rewriting of interpreted text.

That said, I haven't run into the icecream problem with Whisper. Plenty of other systems fail but Whisper just seems to get lucky and guess the right words more than anything else.

The Google Meet/Android speech recognition is cool but terribly slow in my experience. It also has a tendency to over-correct for some reason, probably because of the "best of N" system you mention.

llarsson a day ago | parent | prev | next [-]

Attention is all you need, as the transformative paper (pun definitely intended) put it.

Unfortunately, you're only getting attention in 3 second chunks.

abdullahkhalids 17 hours ago | parent | prev | next [-]

Which other streaming transcription services are you referring to?

londons_explore 13 hours ago | parent [-]

Googles speech to text API: https://cloud.google.com/speech-to-text/docs/speech-to-text-...

The "alternatives" and "confidence" field is the result of the N-best decodings described elsewhere in the thread.

no_wizard 21 hours ago | parent | prev [-]

That’s because at the end of the day this technology doesn’t “think”. It simply holds context until the next thing without regard for the previous information

anonymousiam a day ago | parent | prev | next [-]

Whisper is excellent, but not perfect.

I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."

JohnKemeny 21 hours ago | parent | next [-]

Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.

ctxc 20 hours ago | parent [-]

Thanks John Key Many!

t-3 19 hours ago | parent | prev [-]

That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.

anonymousiam 16 hours ago | parent [-]

When she told me her name, I didn't ask her to repeat it, and I got it right through the rest of the call. Whisper didn't, so how is this "at least s good as a human?"

t-3 16 hours ago | parent [-]

I wouldn't expect any transcriber to know that the correct spelling in your case used a G rather than a J - the J is far more common in my experience. "Jim" would be an aberration that could be improved, but substitution "Jem" for "Gem" without any context to suggest the latter would be just fine IMO.

0points a day ago | parent | prev [-]

So, yes, and also no.

lgessler a day ago | parent | prev | next [-]

I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf

I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".

shaunpud a day ago | parent | prev | next [-]

I Scream in the Sun https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun

DiogenesKynikos a day ago | parent | prev | next [-]

This is what your brain does when it processes language.

I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.

mockingloris a day ago | parent [-]

A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.

I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.

└── Dey well; Be well

cyphar 20 hours ago | parent [-]

This is called linguist relativity (nee. The Sapir-Whorf hypothesis) and the strong form you describe has fallen out of favour in modern linguistics.

A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.

Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).

(This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)

sigbottle 15 hours ago | parent | next [-]

I think it's more like, you have a thought X, that has so many dimensions to it, but the way you serialize it to something that's actually discussable and comparable to other thoughts is language. And sometimes that language naturally loves slicing one part of that thought one way or the other.

(then there's also a feedback loop type of argument, that always happens when discussing any sort of perception-reality distinction, but let's ignore that for now)

At least for me, my brain is so bad and it's hard for me to truly hold a single thought in my head for a long time. Maybe it eventually settles into my subconscious but I don't really have a way to verify that.

numpad0 17 hours ago | parent | prev | next [-]

> if you truly did think in the language you speak, how could this situation happen?

As far as how it happens to me is concerned, either something closer to speech than raw thoughts reports back the data in shared memory is invalid for selected language, or I find there's no text representation exist for what I am trying to say.

The "raw" thoughts work with the currently active language, for me, so at least for me, I just know strong Sapir-Whorf hypothesis is not even a hypothesis, but just a reasonable verbalization closely matching my own observations.

I don't get why people can't take it, even in the age of LLMs. It is what it is and that old guy is just never correct even for once.

codedokode 18 hours ago | parent | prev [-]

My experience is that sometimes, for example, when I watch a lecture in a foreign language, there could be some terms for which I don't know the correct translation so I cannot think about or mention them in my native language, while I understand what they mean.

cyphar 2 hours ago | parent [-]

I was more focused on the experience of monolinguals (where this kind of explanation is impossible), but yes I also experience this fairly often as someone who speaks more than one language.

ec109685 11 hours ago | parent | prev | next [-]

The I is emphasized more in I scream than ice cream I think.

But it’s great point that you need context to be sure.

didacusc a day ago | parent | prev [-]

what would it make of this? https://www.youtube.com/watch?v=zyvZUxnIC3k