Is this really worthy of a claude 4 label? Was there a new pre-training run? Cause this feels like 3.8... only swe went up significantly, and that as we all understand by now is done by cramming on specific post training data and doesn't generalize to intelligence. The agentic tooluse didn't improve and this says to me that it's not really smarter.

▲

minimaxir 2 months ago | parent | next [-]

So I decided to try Claude 4 Sonnet against my "Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30." benchmark I tested against Claude 3.5 Sonnet: https://news.ycombinator.com/item?id=42584400

The results are here (https://gist.github.com/minimaxir/1bad26f0f000562b1418754d67... ) and it utterly crushed the problem with the relevant microoptimizations commented in that HN discussion (oddly in the second pass it a) regresses from a vectorized approach to a linear approach and b) generates and iterates on three different iterations instead of one final iteration), although it's possible Claude 4 was trained on that discussion lol.

EDIT: "utterly crushed" may have been hyperbole.

▲

diggan 2 months ago | parent | next [-]

> although it's possible Claude 4 was trained on that discussion lol

Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing in a couple of hours via the Algolia API.

Recommendation for the future: keep your benchmarks/evaluations private, as otherwise they're basically useless as more models get published that are trained on your data. This is what I do, and usually I don't see the "huge improvements" as other public benchmarks seems to indicate when new models appear.

	▲	dr_kiszonka 2 months ago \| parent [-]
		>> although it's possible Claude 4 was trained on that discussion lol > Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing via the Algolia API. I am wondering if this could be cleverly exploited. <twirls mustache>

▲

bsamuels 2 months ago | parent | prev | next [-]

as soon as you publish a benchmark like this, it becomes worthless because it can be included in the training corpus

	▲	rbjorklin 2 months ago \| parent [-]
		While I agree with you in principle give Claude 4 a try on something like: https://open.kattis.com/problems/low . I would expect this to have been included in the training material as well as solutions found on Github. I've tried providing the problem description and asking Claude Sonnet 4 to solve it and so far it hasn't been successful.

▲

thethirdone 2 months ago | parent | prev | next [-]

The first iteration vectorized with numpy is the best solution imho. The only additional optimization is using modulo 9 to give you a sum of digits mod 9; that should filter out approximately 1/9th of numbers. The digit summing is the slow part so reducing the number of values there results in a large speedup. Numpy can do that filter pretty fast as `arr = arr[arr%9==3]`

With that optimization its about 3 times faster, and all of the none numpy solutions are slower than the numpy one. In python it almost never makes sense to try to manually iterate for speed.

▲

isahers32 2 months ago | parent | prev | next [-]

Might just be missing something, but isn't 9+9+9+9+3=39? The largest number I believe is 99930? Also, it could further optimize by terminating digit sum calculations earlier if sum goes above 30 or could not reach 30 (num digits remaining * 9 is less than 30 - current_sum). imo this is pretty far from "crushing it"

▲

Epa095 2 months ago | parent | prev | next [-]

I find it weird that it does a inner check on ' num > 99999', which pretty much only checks for 100,000. It could check for 99993, but I doubt even that check makes it much faster.

But have you checked with some other number than 30? Does it screw up the upper and lower bounds?

▲

losvedir 2 months ago | parent | prev | next [-]

Same for me, with this past year's Advent of Code. All the models until now have been stumped by Day 17 part 2. But Opus 4 finally got it! Good chance some of that is in its training data, though.

▲

jonny_eh 2 months ago | parent | prev | next [-]

> although it's possible Claude 4 was trained on that discussion lol

This is why we can't have consistent benchmarks

	▲	teekert 2 months ago \| parent [-]
		Yeah I agree, also, what is the use of that benchmark? Who cares? How does it related to stuff that does matter?

▲

kevindamm 2 months ago | parent | prev [-]

I did a quick review of its final answer and looks like there are logic errors.

All three of them get the incorrect max-value bound (even with comments saying 9+9+9+9+3 = 30), so early termination wouldn't happen in the second and third solution, but that's an optimization detail. The first version would, however, early terminate on the first occurrence of 3999 and take whatever the max value was up to that point. So, for many inputs the first one (via solve_digit_sum_difference) is just wrong.

The second implementation (solve_optimized, not a great name either) and third implementation, at least appear to be correct... but that pydoc and the comments in general are atrocious. In a review I would ask these to be reworded and would only expect juniors to even include anything similar in a pull request.

I'm impressed that it's able to pick a good line of reasoning, and even if it's wrong about the optimizations it did give a working answer... but in the body of the response and in the code comments it clearly doesn't understand digit extraction per se, despite parroting code about it. I suspect you're right that the model has seen the problem solution before, and is possibly overfitting.

Not bad, but I wouldn't say it crushed it, and wouldn't accept any of its micro-optimizations without benchmark results, or at least a benchmark test that I could then run.

Have you tried the same question with other sums besides 30?

	▲	minimaxir 2 months ago \| parent [-]
		Those are fair points. Even with those issues, it's still better substantially better than the original benchmark (maybe "crushing it" is too subjective a term). I reran the test to run a dataset of 1 to 500,000 and sum digits up to 37 and it went back to the numba JIT implementation that was encountered in my original blog post, without numerology shenanigans. https://gist.github.com/minimaxir/a6b7467a5b39617a7b611bda26... I did also run the model at temp=1, which came to the same solution but confused itself with test cases: https://gist.github.com/minimaxir/be998594e090b00acf4f12d552...

▲

wrsh07 2 months ago | parent | prev | next [-]

My understanding for the original OpenAI and anthropic labels was essentially: gpt2 was 100x more compute than gpt1. Same for 2 to 3. Same for 3 to 4. Thus, gpt 4.5 was 10x more compute^

If anthropic is doing the same thing, then 3.5 would be 10x more compute vs 3. 3.7 might be 3x more than 3.5. and 4 might be another ~3x.

^ I think this maybe involves words like "effective compute", so yeah it might not be a full pretrain but it might be! If you used 10x more compute that could mean doubling the amount used on pretraining and then using 8x compute on post or some other distribution

▲

swyx 2 months ago | parent [-]

beyond 4 thats no longer true - marketing took over from the research

	▲	wrsh07 2 months ago \| parent [-]
		Oh shoot I thought that still applied to 4.5 just in a more "effective compute" way (not 100x more parameters, but 100x more compute in training) But alas, it's not like 3nm fab means the literal thing either. Marketing always dominates (and not necessarily in a way that adds clarity)

▲

Bjorkbat 2 months ago | parent | prev | next [-]

I was about to comment on a past remark from Anthropic that the whole reason for the convoluted naming scheme was because they wanted to wait until they had a model worth of the "Claude 4" title.

But because of all the incremental improvements since then, the irony is that this merely feels like an incremental improvement. It obviously is a huge leap when you consider that the best Claude 3 ever got on SWE-verified was just under 20% (combined with SWE-agent), but compared to Claude 3.7 it doesn't feel like that big of a deal, at least when it comes to SWE-bench results.

Is it worthy? Sure, why not, compared to the original Claude 3 at any rate, but this habit of incremental improvement means that a major new release feels kind of ordinary.

▲

causal 2 months ago | parent | prev | next [-]

Slight decrease from Sonnet 3.7 in a few areas even. As always benchmarks say one thing, will need some practice with it to get a subjective opinion.

▲

jacob019 2 months ago | parent | prev | next [-]

Hey, at least they incremented the version number. I'll take it.

▲

mike_hearn 2 months ago | parent | prev | next [-]

They say in the blog post that tool use has improved dramatically: parallel tool use, ability to use tools during thinking and more.

▲

oofbaroomf 2 months ago | parent | prev | next [-]

The improvement from Claude 3.7 wasn't particularly huge. The improvement from Claude 3, however, was.

▲

drusepth 2 months ago | parent | prev | next [-]

To be fair, a lot of people said 3.7 should have just been called 4. Maybe they're just bridging the gap.

▲

greenfish6 2 months ago | parent | prev | next [-]

Benchmarks don't tell you as much as the actual coding vibe though

▲

zamadatix 2 months ago | parent | prev | next [-]

It feels like the days of Claude 2 -> 3 or GPT 2->3 level changes for the leading models are over and you're either going to end up with really awkward version numbers or just embrace it and increment the number. Nobody cares a Chrome update gives a major version change of 136->137 instead of 12.4.2.33 -> 12.4.3.0 for similar kinds of "the version number doesn't always have to represent the amount of work/improvement compared to the previous" reasoning.

▲

saubeidl 2 months ago | parent | next [-]

It feels like LLM progress in general has kinda stalled and we're only getting small incremental improvements from here.

I think we've reached peak LLM - if AGI is a thing, it won't be through this architecture.

▲

nico 2 months ago | parent | next [-]

Diffusion LLMs seem like they could be a huge change

Check this out from yesterday (watch the short video here):

https://simonwillison.net/2025/May/21/gemini-diffusion/

From:

https://news.ycombinator.com/item?id=44057820

	▲	antupis 2 months ago \| parent [-]
		Yup especially locally diffusion models will be big.

▲

bcrosby95 2 months ago | parent | prev | next [-]

Even if LLMs never reach AGI, they're good enough to where a lot of very useful tooling can be built on top of/around them. I think of it more as the introduction of computing or the internet.

That said, whether or not being a provider of these services is a profitable endeavor is still unknown. There's a lot of subsidizing going on and some of the lower value uses might fall to the wayside as companies eventually need to make money off this stuff.

▲

michaelbrave 2 months ago | parent [-]

this was my take as well. Though after a while I've started thinking about it closer to the introduction of electricity which in a lot of ways would be considered the second stage of the industrial revolution, the internet and AI might be considered the second stage of the computing revolution (or so I expect history books to label it as). But just like electricity, it doesn't seem to be very profitable for the providers of electricity, but highly profitable for everything that uses it.

	▲	2 months ago \| parent [-]
		[deleted]

▲

goatlover 2 months ago | parent | prev | next [-]

Which brings up the question of why AGI is a thing at all. Shouldn't LLMs just be tools to make humans more productive?

	▲	amarcheschi 2 months ago \| parent [-]
		Think of the poor vcs who are selling agi as the second coming of christ

▲

jenny91 2 months ago | parent | prev [-]

I think it's a bit early to say. At least in my domain, the models released this year (Gemini 2.5 Pro, etc). Are crushing models from last year. I would therefore not by any means be ready to call the situation a stall.

▲

onlyrealcuzzo 2 months ago | parent | prev [-]

Did you see Gemini 1.5 pro vs 2.5 pro?

▲

zamadatix 2 months ago | parent [-]

Sure, but despite there being a 2.0 release between they didn't even feel the need to release a Pro for it still isn't the kind of GPT 2 -> 3 improvement we were hoping would continue for a bit longer. Companies will continue to release these incremental improvements which are all always neck-and-neck with each other. That's fine and good, just don't inherently expect the versioning to represent the same relative difference instead of the relative release increment.

	▲	cma 2 months ago \| parent [-]
		I'd say 2.5 pro to 1.5 pro was a 3 -> 4 level improvement, but the problem is 1.5 pro wasn't state of the art when released, except for context length, and 2.5 wasn't that kind of improvement compared to the best open AI or Claude stuff that was available when it released. 1.5 pro was worse than original gpt4 on several coding things I tried head to head.

▲

2 months ago | parent | prev [-]

[deleted]