Remix.run Logo
Show HN: SOTA NLP Models(huggingface.co)
1 points by ChadNauseam 6 hours ago

Hi everyone, I needed to break sentences into their individual words and figure out what part of speech each word is. Explosion's Spacy models are absolutely incredible for English, clearly some top tier engineering that I could never come close to, but for other languages they're quite weak. I created my own by taking Spacy outputs, cleaning them up with an LLM, and then fine-tuning a Gemma model on that. The result is extremely good and consistent results for 7 languages. The models are also much cheaper and more consistent than would be possible with ChatGPT. (For example, should "don't" be treated as "don't" or "do", "n't"? ChatGPT will pick one randomly.)

It sounds simple, and I'm not going to say it was the most complicated thing ever, but there were quite a few steps involved in getting it right. Getting LLMs to do the cleanup task consistently is very hard. You wouldn't think it but there are often multiple ways to break down a sentence.

An interesting part was structuring the model output so it could use the exact same tokens as the input. Most tokens are prefixed by a space, so you want the model's "desired output" to also involve the words prefixed by a space. It makes the task much easier because the model doesn't have to learn the mapping between prefixed and unprefixed tokens. Doing that instantly made my models start performing much better.