▲ | novok 3 months ago | |||||||
You can also download the audio only with yt-dlp and then remake subs with whisper or whatever other model you want. GPU compute wise it will probably be less than asking an llm to try to correct a garbled transcript. | ||||||||
▲ | ldenoue 3 months ago | parent | next [-] | |||||||
The current Flash-8B model I use costs $1 per 500 hours of transcript. | ||||||||
| ||||||||
▲ | HPsquared 3 months ago | parent | prev [-] | |||||||
I suppose the gold standard would be a multimodal model that also looks at the screen (maybe only if the captions aren't making much sense). |