Remix.run Logo
wongarsu 4 hours ago

Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something

Imustaskforhelp 4 hours ago | parent [-]

if this is the case, then GLM 5.2 model seems better than gpt 5.5 or maybe even "Fable" depending upon what you are trying to achieve.

Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)