▲ | sdesol 6 days ago | ||||||||||||||||
Something that I'm working on is making it easy to fix spelling and grammatical errors in documents that can affect BM25 and embeddings. So in addition to generating keyword/metadata with LLM, you could also ask it to clean the document; however, based on what I've learned so far, fixing spelling and grammatical errors should involve humans in the process, so you really can't automate this. | |||||||||||||||||
▲ | andai 6 days ago | parent | next [-] | ||||||||||||||||
Fascinating. I think the process could be automated, though I don't know if it's been invented yet. You would want to use the existing autocomplete tech (probabilistic models based on Levenshtein distance and letter proximity on keyboard?) in combination with actually understanding the context of the article and using that to select the right correction. Actually, it sounds fairly trivial to slap those two together, and the 2nd half sounds like something a humble BERT could handle? (I've heard people getting great results with BERTs in current year, though they usually fine-tune them on their particular domain.) I actually think even BERT could be overkill here -- I have a half-baked prototype of a keyword expansion system that should do the trick here. The idea is is to construct a data structure of keywords ahead of time (e.g. by data-mining some portion of Common Crawl), where each keyword has "neighbors" -- words that often appear together and (sometimes, but not always) signal relatedness. I didn't take the concept very far yet, but I give it better than even odds! (Especially if the resulting data structure is pruned by a half-decent LLM -- my initial attempts resulted in a lot of questionable "neighbors" -- though I had a fairly small dataset so it's likely I was largely looking at noise.) | |||||||||||||||||
| |||||||||||||||||
▲ | firejake308 6 days ago | parent | prev [-] | ||||||||||||||||
> fixing spelling and grammatical errors should involve humans in the process, so you really can't automate this This is an interesting observation to me. I would have expected that, since LLMs evolved from autocomplete/autocorrect algorithms, correcting spelling mistakes would be one of their strong suits. Do you have examples of cases where they fail? | |||||||||||||||||
|