▲ | waltbosz 20 hours ago | |
I wonder how the results of an AI Japanese-audio-to-English-subtitles would compare to a fansub-ed anime. I'm guessing it would be a more literal translation vs. contextual or cultural. I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-ma... Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio. | ||
▲ | chazeon 20 hours ago | parent | next [-] | |
I do japanese transcription + gemini translations. It’s worse than fansub, but its much much better than nothing. First thing that could struggle is actually the vad, then is special names and places, prompting can help but not always. Finally it’s uniformity (or style). I still feel that I can’t control the punctuation well. | ||
▲ | numpad0 18 hours ago | parent | prev [-] | |
I was recently just playing around with Google Cloud ASR as well as smaller Whisper models, and I can say it hasn't gotten to that point: Japanese ASRs/STTs all generate final kanji-kana mixed text, and since kanji:pronunciation is n:n maps, it's non-trivial enough that it currently need hands from human native speakers to fix misheard texts in a lot of cases. LLMs should be theoretically good at this type of tasks, but they're somehow clueless about how Japanese pronunciation works, and they just rubber-stamp inputs as written. The conversion process from pronunciation to intended text is not deterministic either, so it probably can't be solved by "simply" generating all-pronunciation outputs. Maybe a multimodal LLM as ASR/STT, or a novel dual input as-spoken+estimated-text validation model could be made? I wouldn't know, though. It seemed like a semi-open question. |