new | show | ask | jobs Github

margalabargala 5 hours ago

And therefore it scores worse on benchmarks?

▲

XCSme 5 hours ago | parent | next [-]

Also Claude/Fable models are quite bad at instructions following: https://artificialanalysis.ai/evaluations/ifbench

▲

XCSme 5 hours ago | parent | prev [-]

On some it does yes, also in real usage.

It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already.

▲

margalabargala 4 hours ago | parent [-]

I'm glad those tests apparently work out for you but a benchmark where three of the top 5 models are different flavors of Gemini Flash and zero are anything by Anthropic, is just so wildly divergent from my personal experience with the models that it's not useful to me.

Whatever it is you're measuring, it's not anything related to what I use models for.

▲

XCSme 4 hours ago | parent [-]

Thanks for the feedback!

What are you using Claude models for? Coding only? Computer use? Which harness?

	▲	margalabargala 4 hours ago \| parent [-]
		Not only coding but also general knowledge work, anything from learning about how some things work (e.g. walking me through PNP vs NPN transistors) to summarizing texts, doing web research, and occasionally some OCR. I've experimented with a few models for all this and have found Gemini the best at OCR but quite a bit worse at the rest. Claude is worse than GPT at web research-shaped things, but Opus 4.8 wins my anecdote benchmark for the other tasks besides those two. But really, for code or knowlege stuff Gemini is markedly worse than the others, while Opus and GPT 5.5 are very very close.