Remix.run Logo
stingraycharles 2 days ago

> Assuming it is almost as good as Opus 4.6 (which benchmarks seem to give evidence for)

That’s a big if. It’s my experience that models that perform very well on benchmarks do not necessarily perform well in real life.

I’ve mostly started ignoring the benchmarks and run my own evals.

ting0 2 days ago | parent | next [-]

> It’s my experience that models that perform very well on benchmarks do not necessarily perform well in real life

Well, yeah... Like Opus 4.5, 4.6, 4.7. Top of the benchmarks and yet it's a pile of crap at the moment and has been for months.

jatora 2 days ago | parent | prev [-]

If benchmarks are all to be believed then gemini 3.1 and grok 4.2 are still in the lead pack. A laughable notion to anyone who has actually tried to use them and compared.