Remix.run Logo
dmurray 4 days ago

For the 800 names that were missing declension data in the database, it seems like the most straightforward thing to do would be to assign their declensions by hand. It shouldn't take a native speaker more than a couple of hours (if some name they haven't seen before is ambiguous, then whatever they guess at least won't sound obviously wrong to other native speakers). Alternatively, very cheap to ask an LLM to do it.

Encoding them into a trie like this would still be a good way to distribute the result, but you don't have to rely on the trie also being a good way to guess the declensions.

alexharri 4 days ago | parent | next [-]

It would be good to cover more names for sure -- that's an ongoing process at DIM. Names are frequently added to the approved list of Icelandic names, so there's always going to be some lag.

I would not be confident enough myself to add the data myself since I'd probably be wrong a lot of the time. When reviewing the results for the top 100 unknown names I frequently got results that I thought _might_ be wrong, but I wasn't sure. For those, I looked up similar names in DIM to verify, and often thought "huh, I would not have declined those names like this". For that reason, I rely on the DIM data as the source of truth since it's maintained by experts on the language.

perching_aix 4 days ago | parent | prev | next [-]

Yeah, that'd be a good idea. That said, it still wouldn't resolve the issue for names that are in-use despite not being approved (or foreign names).

I also live in a country with a centrally governed personal name list, but you can request exceptions, and there are people who were born before the list existed, so their names won't necessarily be on the list either. Immigrants can also retain their names during naturalization I believe, and there can be lots of other complications still. So the ability to sorta-kinda predict the proper declension is still useful.

thaumasiotes 4 days ago | parent [-]

Related: https://en.wikipedia.org/wiki/Naming_laws_in_China#Ma_Cheng

wizzwizz4 4 days ago | parent | prev | next [-]

I see no reason that an LLM should be better at guessing than a trie (unless the actual example was in its training data, in which case a web search would be more appropriate).

dmurray 4 days ago | parent [-]

I agree. I just like having the guessing done at compile time on principle. It allows you to change a guess, if you find that it's wrong, and convince yourself that you haven't broken any of the other cases where you were previously accidentally right.

wizzwizz4 3 days ago | parent [-]

My main objection is the temptation to mix real and fabricated data. Your entire dataset becomes much less useful if it's got nonsense mixed in with it, and if historical examples are anything to go by, it can be hundreds of years before someone identifies and untangles the nonsense from the fact. Any minor benefit is not worth this risk imo.

esafak 4 days ago | parent | prev [-]

I wonder if existing LLMs already know these patterns?

jer0me 4 days ago | parent | next [-]

The Icelandic government has been proactive about helping OpenAI train its models on the language to stave off extinction: https://openai.com/index/government-of-iceland/

xigoi 4 days ago | parent [-]

If they’d rather support open-source models so the future of the language is not in the hands of a single foreign corporation…

thaumasiotes 4 days ago | parent | prev [-]

Yes, this is an example of a problem that an LLM is ideally suited to solve.