Remix.run Logo
thmpp 2 days ago

While 'this analysis would not have been possible without LLM', I am not sure the LLM analysis was well reviewed after it has been done. From the obscure/familiar word list, some of the n-grams, e.g. "is resource", "seq size", "db xref" surely happen in the wild (we well know), but I would doubt that we can argue they are missing from the dictionary. Knowing the realm, I would argue none of them are words, not even collocations. If "is resource" is, why not, "has resource"? So while the path is surely interesting, this analysis does miss scrutiny, which you would expect from a high-level LLM analysis.

michaeld123 2 days ago | parent [-]

The very bottom of the slider is there to illustrate where LLM artifacts and Wiktionary noise live — it's not presented as legitimate vocabulary. The slider lets you see the full quality gradient, including where it breaks down.

exmadscientist 3 hours ago | parent [-]

That's not really mentioned in the article, though. As far as the article is concerned, the right side of that slider is valid-but-possibly-too-rare-to-be-interesting, when in fact it's just garbage. This does not sell the concept well.

michaeld123 an hour ago | parent [-]

You were right — it is now. Thanks