Remix.run Logo
magicalhippo 11 hours ago

> Or, if LLMs are so smart, why doesn't it say "Hmmm, would you like to use a different model for this?"

That is indeed an area where LLMs don't shine.

That is, not only are they trained to always respond with an answer, they have no ability to accurately tell how confident they are in that answer. So you can't just filter out low confidence answers.

mathewsanders 11 hours ago | parent [-]

Something I think would be interesting for model APIs and consumer apps to exposed would be the probability of each individual token generated.

I’m presuming that one class of junk/low quality output is when the model doesn’t have high probability next tokens and works with whatever poor options it has.

Maybe low probability tokens that cross some threshold could have a visual treatment to give feedback the same way word processors give feedback in a spelling or grammatical error.

But maybe I’m making a mistake thinking that token probability is related to the accuracy of output?

StilesCrisis 10 hours ago | parent [-]

Lots of research has been done here. e.g. https://aclanthology.org/2024.findings-acl.558.pdf