So I decided to try Claude 4 Sonnet against my "Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30." benchmark I tested against Claude 3.5 Sonnet: https://news.ycombinator.com/item?id=42584400

The results are here (https://gist.github.com/minimaxir/1bad26f0f000562b1418754d67... ) and it utterly crushed the problem with the relevant microoptimizations commented in that HN discussion (oddly in the second pass it a) regresses from a vectorized approach to a linear approach and b) generates and iterates on three different iterations instead of one final iteration), although it's possible Claude 4 was trained on that discussion lol.

EDIT: "utterly crushed" may have been hyperbole.

▲

diggan 2 months ago | parent | next [-]

> although it's possible Claude 4 was trained on that discussion lol

Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing in a couple of hours via the Algolia API.

Recommendation for the future: keep your benchmarks/evaluations private, as otherwise they're basically useless as more models get published that are trained on your data. This is what I do, and usually I don't see the "huge improvements" as other public benchmarks seems to indicate when new models appear.

	▲	dr_kiszonka 2 months ago \| parent [-]
		>> although it's possible Claude 4 was trained on that discussion lol > Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing via the Algolia API. I am wondering if this could be cleverly exploited. <twirls mustache>

▲

bsamuels 2 months ago | parent | prev | next [-]

as soon as you publish a benchmark like this, it becomes worthless because it can be included in the training corpus

	▲	rbjorklin 2 months ago \| parent [-]
		While I agree with you in principle give Claude 4 a try on something like: https://open.kattis.com/problems/low . I would expect this to have been included in the training material as well as solutions found on Github. I've tried providing the problem description and asking Claude Sonnet 4 to solve it and so far it hasn't been successful.

▲

thethirdone 2 months ago | parent | prev | next [-]

The first iteration vectorized with numpy is the best solution imho. The only additional optimization is using modulo 9 to give you a sum of digits mod 9; that should filter out approximately 1/9th of numbers. The digit summing is the slow part so reducing the number of values there results in a large speedup. Numpy can do that filter pretty fast as `arr = arr[arr%9==3]`

With that optimization its about 3 times faster, and all of the none numpy solutions are slower than the numpy one. In python it almost never makes sense to try to manually iterate for speed.

▲

isahers32 2 months ago | parent | prev | next [-]

Might just be missing something, but isn't 9+9+9+9+3=39? The largest number I believe is 99930? Also, it could further optimize by terminating digit sum calculations earlier if sum goes above 30 or could not reach 30 (num digits remaining * 9 is less than 30 - current_sum). imo this is pretty far from "crushing it"

▲

Epa095 2 months ago | parent | prev | next [-]

I find it weird that it does a inner check on ' num > 99999', which pretty much only checks for 100,000. It could check for 99993, but I doubt even that check makes it much faster.

But have you checked with some other number than 30? Does it screw up the upper and lower bounds?

▲

losvedir 2 months ago | parent | prev | next [-]

Same for me, with this past year's Advent of Code. All the models until now have been stumped by Day 17 part 2. But Opus 4 finally got it! Good chance some of that is in its training data, though.

▲

jonny_eh 2 months ago | parent | prev | next [-]

> although it's possible Claude 4 was trained on that discussion lol

This is why we can't have consistent benchmarks

	▲	teekert 2 months ago \| parent [-]
		Yeah I agree, also, what is the use of that benchmark? Who cares? How does it related to stuff that does matter?

▲

kevindamm 2 months ago | parent | prev [-]

I did a quick review of its final answer and looks like there are logic errors.

All three of them get the incorrect max-value bound (even with comments saying 9+9+9+9+3 = 30), so early termination wouldn't happen in the second and third solution, but that's an optimization detail. The first version would, however, early terminate on the first occurrence of 3999 and take whatever the max value was up to that point. So, for many inputs the first one (via solve_digit_sum_difference) is just wrong.

The second implementation (solve_optimized, not a great name either) and third implementation, at least appear to be correct... but that pydoc and the comments in general are atrocious. In a review I would ask these to be reworded and would only expect juniors to even include anything similar in a pull request.

I'm impressed that it's able to pick a good line of reasoning, and even if it's wrong about the optimizations it did give a working answer... but in the body of the response and in the code comments it clearly doesn't understand digit extraction per se, despite parroting code about it. I suspect you're right that the model has seen the problem solution before, and is possibly overfitting.

Not bad, but I wouldn't say it crushed it, and wouldn't accept any of its micro-optimizations without benchmark results, or at least a benchmark test that I could then run.

Have you tried the same question with other sums besides 30?

	▲	minimaxir 2 months ago \| parent [-]
		Those are fair points. Even with those issues, it's still better substantially better than the original benchmark (maybe "crushing it" is too subjective a term). I reran the test to run a dataset of 1 to 500,000 and sum digits up to 37 and it went back to the numba JIT implementation that was encountered in my original blog post, without numerology shenanigans. https://gist.github.com/minimaxir/a6b7467a5b39617a7b611bda26... I did also run the model at temp=1, which came to the same solution but confused itself with test cases: https://gist.github.com/minimaxir/be998594e090b00acf4f12d552...