| ▲ | epitrochoid413 3 hours ago | |||||||
I built a context-aware furigana converter for Japanese text, files, and web pages. The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context: * 市場: いちば or しじょう * 大分: おおいた or だいぶ * 人気: にんき or ひとけ * 最中: さいちゅう or さなか or もなか * 方: かた or ほう The engine is a hybrid system: * Sudachi for tokenization, base forms, POS, and candidate readings * Expanded dictionary coverage for compounds and fixed expressions * Custom rules for counters, suffixes, rendaku patterns, and phrase overrides * ModernBERT fallback for 144 especially context-dependent target words I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions. The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms. I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system. | ||||||||
| ▲ | fenomas 25 minutes ago | parent [-] | |||||||
Nice work, just gave a quick pass but seems to work well! (Also: vouched, your comment was dead FYI) | ||||||||
| ||||||||