| ▲ | cube2222 8 hours ago | |||||||
Relatedly, I think it's worth noting that Anthropic models have consistently been top-scoring in BullshitBench[0], in a league of their own, really. Not affiliated with the bench in any way, but I think it surfaces important differences between the behavior of the models from different labs. TLDR: The benchmark is measuring pushback in response to nonsensical requests and questions, as opposed to going with it and hallucinating a nonsensical answer. [0]: https://petergpt.github.io/bullshit-benchmark/viewer/index.v... | ||||||||
| ▲ | mcintyre1994 7 hours ago | parent | next [-] | |||||||
TBH this is the main thing that made me start trusting Claude enough to actually find it useful, and I'm surprised other models haven't caught up. I assumed they had and I just wasn't aware because I'm not using them in the same way. | ||||||||
| ▲ | Supermancho 5 hours ago | parent | prev [-] | |||||||
> I found my interactions with Fable to be extremely impressive; it made other models, including GPT 5.5 and Opus 4.8, feel small and dumb. > Anthropic models have consistently been top-scoring in BullshitBench[0] eyeroll I find that Anthropic models feel big and dumber. https://www.endorlabs.com/research/ai-code-security-benchmar... puts Fable 5th, which seems about right to me. I'm interested in code utility and correctness, even if the majority of AI use is not focused on that. | ||||||||
| ||||||||