Those transcriptions are already done by LLMs in the first place - in fact, audio transcription was one of the very first large scale commercial uses of the technology in its current iteration.

This is just like playing a game of markov telephone where the step in OP's solution is likely higher compute cost than the step YT uses, because YT is interested in minimizing costs.

▲

albertzeyer a year ago | parent [-]

Probably just "regular" LMs, not large LMs, I assume. I assume some LM with 10-100M params or so, which is cheap to use (and very standard for ASR).

	▲	devmor a year ago \| parent [-]
		Could be. I ran through some offline LMs for voice assisted home automation a couple years ago and they were subpar compared to even the pathetic offering that Youtube provides - but Google of course has much more focused resources to fine tune a small dataset model.