new | show | ask | jobs Github

jkelleyrtp 2 hours ago

On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")

- Opus 4.7 xhigh: 5.2%

- Opus 4.8 xhigh: 13.4%

- Fable 5 xhigh: 29.3%

Seems like a huge jump.

[1] https://cognition.ai/blog/frontier-code

▲

amluto 2 hours ago | parent | next [-]

That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:

1. That estimate could easily be wrong.

2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.

3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.

▲

swyx 41 minutes ago | parent | prev | next [-]

jump in chart form https://x.com/swyx/status/2064414823748886591/photo/1

▲

zzleeper 2 hours ago | parent | prev | next [-]

How credible is this benchmark? does it correlated with others real world experience?

▲

bfeynman 2 hours ago | parent | next [-]

Given it was made by cognition (team behind devin flop) who now just got to wait out until claude and gpt5 basically do all of the work for them - not very. When you read about it, the framework is highly subjective. Which very quickly becomes a problem because its based on heuristics that probably change a bunch with a better code model.

	▲	vanuatu 2 hours ago \| parent [-]
		the subjective framework is exactly why its good prior bms relied mostly on unit tests or synthetic judges which are easily benchmaxxed, which leads to nobody trusting benchmarks we need people manually checking the data for good code quality

▲

vanuatu 2 hours ago | parent | prev | next [-]

i worked on one of the benchmarks typically found in new model releases

this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)

▲

Catloafdev 2 hours ago | parent | prev | next [-]

It's a relatively new benchmark but from what I can tell it has serious cred behind it. I assume it will be picked up as part of the standard suite of CS-related benchmarks soon enough.

▲

schipperai an hour ago | parent | prev | next [-]

Cognition did well in documenting their approach [1].

TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.

[1]: https://x.com/cognition/status/2064061031912288715

▲

emp17344 2 hours ago | parent | prev [-]

Seems like it literally popped up yesterday with the express purpose of building hype for this release.

▲

swyx an hour ago | parent | next [-]

team member here - we had been working on frontiercode for ~6-7months. timing just lined up

▲

osti an hour ago | parent | prev | next [-]

And notable absence of DeepSWE benchmark where they do badly, but somehow a benchmark that was published yesterday is in this announcement.

▲

vanuatu 2 hours ago | parent | prev | next [-]

i doubt it, cog wants coding agents to be better because it directly improves their product

they aren't married to a particular lab, most of their usage is their in house model i believe

▲

anthonypasq 2 hours ago | parent | prev [-]

what incentive does Cognition have for doing this? seems like complete nonsense speculation on your part.

▲

bel8 2 hours ago | parent [-]

With billions/trillions of dollars floating around, is it hard to imagine benchmarks could be biased?

I think it's safe to assume everything AI related is heavily biased until proven otherwise. Just like in pharma.

	▲	camdenreslink an hour ago \| parent \| next [-]
		People game benchmarks for fake internet points to get their favorite web framework to the top of the list. I'm pretty sure they will do it for billions of dollars.
	▲	anthonypasq 30 minutes ago \| parent \| prev [-]
		you didnt answer my question. Why would cognition be biased towards making anthropic look good?

▲

hydra-f 2 hours ago | parent | prev | next [-]

Yes, and the price reflects that

▲

leecommamichael 2 hours ago | parent [-]

I'm not familiar with model pricing trends, did they clearly state how the new pricing compares? (Note that I'm actually asking a question, and am not arguing)

EDIT: Oh I see, this is the best link for pricing https://platform.claude.com/docs/en/about-claude/pricing

So the price is double across the board...

▲

bhelkey 2 hours ago | parent | next [-]

>Fable 5 and Mythos 5 are being offered at $10 per million input tokens and $50 per million output tokens

From their pricing page, Opus 4.8 costs $5 per million input tokens and $25 per million output tokens [1].

[1] https://platform.claude.com/docs/en/about-claude/models/over...

	▲	wongarsu 2 hours ago \| parent [-]
		Still cheaper than Opus 4.0 and 4.1 (which was and still is $15/MTok input and $75/MTok output) I would have expected Mythos to be much more expensive than just 2x current Opus (which is clearly cheaper to run than original Opus)

▲

hydra-f 2 hours ago | parent | prev [-]

As per OpenRouter:

Input Price $10/M tokens

Output Price $50/M tokens

Cache Read $1/M tokens

Cache Write $12.50/M tokens

2x Claude Opus 4.8, same as Claude Opus 4.8 (Fast)

Frankly, not even Opus 4.8 would be enough of an incentive to use at that price range (enterprise-wise; would not even bat an eye as a consumer)

▲

OtomotO 33 minutes ago | parent | prev | next [-]

Bummer! When can I finally and confidently get slopcode into Zig?

▲

m3kw9 2 hours ago | parent | prev [-]

FrontierCode is likely paid for by anthropic.

▲

lanthissa 2 hours ago | parent | next [-]

did they not pay them enough to get good ratings on the other 3 models?

whats the logic in claiming its a borked metric when everything listed is an anthropic model.

	▲	Narretz 2 hours ago \| parent [-]
		There a few benchmarks out there where all existing models have abysmal scores. So it's not actually a problem if Antrophic's older models are bad, especially if the jump to the newest model is huge, and the competition is also way below it.

▲

reasonableklout 2 hours ago | parent | prev [-]

Huh? It's a benchmark by Cognition which (1) is building their own models and (2) offers all providers and thus has an incentive to avoid hyping up any one too much.

	▲	jstummbillig 2 hours ago \| parent [-]
		But you can just say shit now. Tokens might not be too cheap to meter but saying shit increasingly is.