You can also download the audio only with yt-dlp and then remake subs with whisper or whatever other model you want. GPU compute wise it will probably be less than asking an llm to try to correct a garbled transcript.

▲

ldenoue 2 years ago | parent | next [-]

The current Flash-8B model I use costs $1 per 500 hours of transcript.

	▲	andai 2 years ago \| parent [-]
		If I read OpenAI's pricing right, then Google's thing is 200 times cheaper?

▲

HPsquared 2 years ago | parent | prev [-]

I suppose the gold standard would be a multimodal model that also looks at the screen (maybe only if the captions aren't making much sense).