▲ | dmurray 4 days ago | ||||||||||||||||||||||
For the 800 names that were missing declension data in the database, it seems like the most straightforward thing to do would be to assign their declensions by hand. It shouldn't take a native speaker more than a couple of hours (if some name they haven't seen before is ambiguous, then whatever they guess at least won't sound obviously wrong to other native speakers). Alternatively, very cheap to ask an LLM to do it. Encoding them into a trie like this would still be a good way to distribute the result, but you don't have to rely on the trie also being a good way to guess the declensions. | |||||||||||||||||||||||
▲ | alexharri 4 days ago | parent | next [-] | ||||||||||||||||||||||
It would be good to cover more names for sure -- that's an ongoing process at DIM. Names are frequently added to the approved list of Icelandic names, so there's always going to be some lag. I would not be confident enough myself to add the data myself since I'd probably be wrong a lot of the time. When reviewing the results for the top 100 unknown names I frequently got results that I thought _might_ be wrong, but I wasn't sure. For those, I looked up similar names in DIM to verify, and often thought "huh, I would not have declined those names like this". For that reason, I rely on the DIM data as the source of truth since it's maintained by experts on the language. | |||||||||||||||||||||||
▲ | perching_aix 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
Yeah, that'd be a good idea. That said, it still wouldn't resolve the issue for names that are in-use despite not being approved (or foreign names). I also live in a country with a centrally governed personal name list, but you can request exceptions, and there are people who were born before the list existed, so their names won't necessarily be on the list either. Immigrants can also retain their names during naturalization I believe, and there can be lots of other complications still. So the ability to sorta-kinda predict the proper declension is still useful. | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | wizzwizz4 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||
I see no reason that an LLM should be better at guessing than a trie (unless the actual example was in its training data, in which case a web search would be more appropriate). | |||||||||||||||||||||||
| |||||||||||||||||||||||
▲ | esafak 4 days ago | parent | prev [-] | ||||||||||||||||||||||
I wonder if existing LLMs already know these patterns? | |||||||||||||||||||||||
|