▲ | minimaxir a day ago | |||||||
So I decided to try Claude 4 Sonnet against my "Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30." benchmark I tested against Claude 3.5 Sonnet: https://news.ycombinator.com/item?id=42584400 The results are here (https://gist.github.com/minimaxir/1bad26f0f000562b1418754d67... ) and it utterly crushed the problem with the relevant microoptimizations commented in that HN discussion (oddly in the second pass it a) regresses from a vectorized approach to a linear approach and b) generates and iterates on three different iterations instead of one final iteration), although it's possible Claude 4 was trained on that discussion lol. EDIT: "utterly crushed" may have been hyperbole. | ||||||||
▲ | diggan a day ago | parent | next [-] | |||||||
> although it's possible Claude 4 was trained on that discussion lol Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing in a couple of hours via the Algolia API. Recommendation for the future: keep your benchmarks/evaluations private, as otherwise they're basically useless as more models get published that are trained on your data. This is what I do, and usually I don't see the "huge improvements" as other public benchmarks seems to indicate when new models appear. | ||||||||
| ||||||||
▲ | bsamuels a day ago | parent | prev | next [-] | |||||||
as soon as you publish a benchmark like this, it becomes worthless because it can be included in the training corpus | ||||||||
| ||||||||
▲ | thethirdone a day ago | parent | prev | next [-] | |||||||
The first iteration vectorized with numpy is the best solution imho. The only additional optimization is using modulo 9 to give you a sum of digits mod 9; that should filter out approximately 1/9th of numbers. The digit summing is the slow part so reducing the number of values there results in a large speedup. Numpy can do that filter pretty fast as `arr = arr[arr%9==3]` With that optimization its about 3 times faster, and all of the none numpy solutions are slower than the numpy one. In python it almost never makes sense to try to manually iterate for speed. | ||||||||
▲ | isahers32 a day ago | parent | prev | next [-] | |||||||
Might just be missing something, but isn't 9+9+9+9+3=39? The largest number I believe is 99930? Also, it could further optimize by terminating digit sum calculations earlier if sum goes above 30 or could not reach 30 (num digits remaining * 9 is less than 30 - current_sum). imo this is pretty far from "crushing it" | ||||||||
▲ | Epa095 a day ago | parent | prev | next [-] | |||||||
I find it weird that it does a inner check on ' num > 99999', which pretty much only checks for 100,000. It could check for 99993, but I doubt even that check makes it much faster. But have you checked with some other number than 30? Does it screw up the upper and lower bounds? | ||||||||
▲ | losvedir a day ago | parent | prev | next [-] | |||||||
Same for me, with this past year's Advent of Code. All the models until now have been stumped by Day 17 part 2. But Opus 4 finally got it! Good chance some of that is in its training data, though. | ||||||||
▲ | jonny_eh a day ago | parent | prev | next [-] | |||||||
> although it's possible Claude 4 was trained on that discussion lol This is why we can't have consistent benchmarks | ||||||||
| ||||||||
▲ | kevindamm a day ago | parent | prev [-] | |||||||
I did a quick review of its final answer and looks like there are logic errors. All three of them get the incorrect max-value bound (even with comments saying 9+9+9+9+3 = 30), so early termination wouldn't happen in the second and third solution, but that's an optimization detail. The first version would, however, early terminate on the first occurrence of 3999 and take whatever the max value was up to that point. So, for many inputs the first one (via solve_digit_sum_difference) is just wrong. The second implementation (solve_optimized, not a great name either) and third implementation, at least appear to be correct... but that pydoc and the comments in general are atrocious. In a review I would ask these to be reworded and would only expect juniors to even include anything similar in a pull request. I'm impressed that it's able to pick a good line of reasoning, and even if it's wrong about the optimizations it did give a working answer... but in the body of the response and in the code comments it clearly doesn't understand digit extraction per se, despite parroting code about it. I suspect you're right that the model has seen the problem solution before, and is possibly overfitting. Not bad, but I wouldn't say it crushed it, and wouldn't accept any of its micro-optimizations without benchmark results, or at least a benchmark test that I could then run. Have you tried the same question with other sums besides 30? | ||||||||
|