nextaccountic 5 hours ago | parent | next [-]

This puts Sonnet 4.6 above Opus 4.6 in the coding index.. kinda hard to trust those numbers.

(Also it puts Opus 4.7 universally above Opus 4.6, and I may be wrong but this doesn't seem to match the experience of most/many/some people. I think it's widely recognized that Anthropic is severely lacking compute and Opus 4.7 is a costs saving measure)

▲

conception 2 hours ago | parent | next [-]

What I’ve usually seen is 4.7 -> 4.5 -> 4.6 in terms of quality. Though 4.7 seems to hallucinate more than before.

▲

manmal 4 hours ago | parent | prev [-]

Anthropic themselves have (had?) this thing where Opus is used for planning and Sonnet for coding.

	▲	nextaccountic 3 hours ago \| parent [-]
		I thought this was a costs saving measure: we plan with the frontier model / SOTA, then code with something cheaper. But then, Anthropic employees don't have rate limits, right?

▲

Alifatisk 5 hours ago | parent | prev | next [-]

Does numbers don't look exciting at all? I may have gotten spoiled by releases from Qwen, Kimi and Z.ai who keep closing the gap between closed weight SOTA models and open weight. From my experience, Grok is only useful for one thing, and that's looking up things for you and gathering a consensus on topics. That's it.

Update, I noted that Grok 4.3 is in the "Most attractive quadrant", that's cool! It is also in the top 5 highest in "AA-Omniscience Index", good! Really good.

▲

progbits 5 hours ago | parent | prev | next [-]

What's with the charts and numbers?

It says #1 for speed but then in the chart it's #2. Also says #10 for intelligence but then it's #7 in the chart.

▲

5 hours ago | parent | prev | next [-]

[deleted]

▲

BoorishBears 5 hours ago | parent | prev [-]

What an exciting game we're playing, where the most popular leaderboard is completely made up and the stakes are in the trillions.