| ▲ | KaoruAoiShiho a day ago |
| Is this really worthy of a claude 4 label? Was there a new pre-training run? Cause this feels like 3.8... only swe went up significantly, and that as we all understand by now is done by cramming on specific post training data and doesn't generalize to intelligence. The agentic tooluse didn't improve and this says to me that it's not really smarter. |
|
| ▲ | minimaxir a day ago | parent | next [-] |
| So I decided to try Claude 4 Sonnet against my "Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30." benchmark I tested against Claude 3.5 Sonnet: https://news.ycombinator.com/item?id=42584400 The results are here (https://gist.github.com/minimaxir/1bad26f0f000562b1418754d67... ) and it utterly crushed the problem with the relevant microoptimizations commented in that HN discussion (oddly in the second pass it a) regresses from a vectorized approach to a linear approach and b) generates and iterates on three different iterations instead of one final iteration), although it's possible Claude 4 was trained on that discussion lol. EDIT: "utterly crushed" may have been hyperbole. |
| |
| ▲ | diggan a day ago | parent | next [-] | | > although it's possible Claude 4 was trained on that discussion lol Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing in a couple of hours via the Algolia API. Recommendation for the future: keep your benchmarks/evaluations private, as otherwise they're basically useless as more models get published that are trained on your data. This is what I do, and usually I don't see the "huge improvements" as other public benchmarks seems to indicate when new models appear. | | |
| ▲ | dr_kiszonka a day ago | parent [-] | | >> although it's possible Claude 4 was trained on that discussion lol > Almost guaranteed, especially since HN tends to be popular in tech circles, and also trivial to scrape the entire thing via the Algolia API. I am wondering if this could be cleverly exploited. <twirls mustache> |
| |
| ▲ | bsamuels a day ago | parent | prev | next [-] | | as soon as you publish a benchmark like this, it becomes worthless because it can be included in the training corpus | | |
| ▲ | rbjorklin a day ago | parent [-] | | While I agree with you in principle give Claude 4 a try on something like: https://open.kattis.com/problems/low .
I would expect this to have been included in the training material as well as solutions found on Github. I've tried providing the problem description and asking Claude Sonnet 4 to solve it and so far it hasn't been successful. |
| |
| ▲ | thethirdone a day ago | parent | prev | next [-] | | The first iteration vectorized with numpy is the best solution imho. The only additional optimization is using modulo 9 to give you a sum of digits mod 9; that should filter out approximately 1/9th of numbers. The digit summing is the slow part so reducing the number of values there results in a large speedup. Numpy can do that filter pretty fast as `arr = arr[arr%9==3]` With that optimization its about 3 times faster, and all of the none numpy solutions are slower than the numpy one. In python it almost never makes sense to try to manually iterate for speed. | |
| ▲ | isahers32 a day ago | parent | prev | next [-] | | Might just be missing something, but isn't 9+9+9+9+3=39? The largest number I believe is 99930? Also, it could further optimize by terminating digit sum calculations earlier if sum goes above 30 or could not reach 30 (num digits remaining * 9 is less than 30 - current_sum). imo this is pretty far from "crushing it" | |
| ▲ | Epa095 a day ago | parent | prev | next [-] | | I find it weird that it does a inner check on ' num > 99999', which pretty much only checks for 100,000. It could check for 99993, but I doubt even that check makes it much faster. But have you checked with some other number than 30? Does it screw up the upper and lower bounds? | |
| ▲ | losvedir a day ago | parent | prev | next [-] | | Same for me, with this past year's Advent of Code. All the models until now have been stumped by Day 17 part 2. But Opus 4 finally got it! Good chance some of that is in its training data, though. | |
| ▲ | jonny_eh a day ago | parent | prev | next [-] | | > although it's possible Claude 4 was trained on that discussion lol This is why we can't have consistent benchmarks | | |
| ▲ | teekert a day ago | parent [-] | | Yeah I agree, also, what is the use of that benchmark? Who cares? How does it related to stuff that does matter? |
| |
| ▲ | kevindamm a day ago | parent | prev [-] | | I did a quick review of its final answer and looks like there are logic errors. All three of them get the incorrect max-value bound (even with comments saying 9+9+9+9+3 = 30), so early termination wouldn't happen in the second and third solution, but that's an optimization detail. The first version would, however, early terminate on the first occurrence of 3999 and take whatever the max value was up to that point. So, for many inputs the first one (via solve_digit_sum_difference) is just wrong. The second implementation (solve_optimized, not a great name either) and third implementation, at least appear to be correct... but that pydoc and the comments in general are atrocious. In a review I would ask these to be reworded and would only expect juniors to even include anything similar in a pull request. I'm impressed that it's able to pick a good line of reasoning, and even if it's wrong about the optimizations it did give a working answer... but in the body of the response and in the code comments it clearly doesn't understand digit extraction per se, despite parroting code about it. I suspect you're right that the model has seen the problem solution before, and is possibly overfitting. Not bad, but I wouldn't say it crushed it, and wouldn't accept any of its micro-optimizations without benchmark results, or at least a benchmark test that I could then run. Have you tried the same question with other sums besides 30? | | |
|
|
| ▲ | wrsh07 a day ago | parent | prev | next [-] |
| My understanding for the original OpenAI and anthropic labels was essentially: gpt2 was 100x more compute than gpt1. Same for 2 to 3. Same for 3 to 4. Thus, gpt 4.5 was 10x more compute^ If anthropic is doing the same thing, then 3.5 would be 10x more compute vs 3. 3.7 might be 3x more than 3.5. and 4 might be another ~3x. ^ I think this maybe involves words like "effective compute", so yeah it might not be a full pretrain but it might be! If you used 10x more compute that could mean doubling the amount used on pretraining and then using 8x compute on post or some other distribution |
| |
| ▲ | swyx a day ago | parent [-] | | beyond 4 thats no longer true - marketing took over from the research | | |
| ▲ | wrsh07 a day ago | parent [-] | | Oh shoot I thought that still applied to 4.5 just in a more "effective compute" way (not 100x more parameters, but 100x more compute in training) But alas, it's not like 3nm fab means the literal thing either. Marketing always dominates (and not necessarily in a way that adds clarity) |
|
|
|
| ▲ | Bjorkbat a day ago | parent | prev | next [-] |
| I was about to comment on a past remark from Anthropic that the whole reason for the convoluted naming scheme was because they wanted to wait until they had a model worth of the "Claude 4" title. But because of all the incremental improvements since then, the irony is that this merely feels like an incremental improvement. It obviously is a huge leap when you consider that the best Claude 3 ever got on SWE-verified was just under 20% (combined with SWE-agent), but compared to Claude 3.7 it doesn't feel like that big of a deal, at least when it comes to SWE-bench results. Is it worthy? Sure, why not, compared to the original Claude 3 at any rate, but this habit of incremental improvement means that a major new release feels kind of ordinary. |
|
| ▲ | causal a day ago | parent | prev | next [-] |
| Slight decrease from Sonnet 3.7 in a few areas even. As always benchmarks say one thing, will need some practice with it to get a subjective opinion. |
|
| ▲ | jacob019 a day ago | parent | prev | next [-] |
| Hey, at least they incremented the version number. I'll take it. |
|
| ▲ | mike_hearn a day ago | parent | prev | next [-] |
| They say in the blog post that tool use has improved dramatically: parallel tool use, ability to use tools during thinking and more. |
|
| ▲ | oofbaroomf a day ago | parent | prev | next [-] |
| The improvement from Claude 3.7 wasn't particularly huge. The improvement from Claude 3, however, was. |
|
| ▲ | drusepth a day ago | parent | prev | next [-] |
| To be fair, a lot of people said 3.7 should have just been called 4. Maybe they're just bridging the gap. |
|
| ▲ | greenfish6 a day ago | parent | prev | next [-] |
| Benchmarks don't tell you as much as the actual coding vibe though |
|
| ▲ | zamadatix a day ago | parent | prev | next [-] |
| It feels like the days of Claude 2 -> 3 or GPT 2->3 level changes for the leading models are over and you're either going to end up with really awkward version numbers or just embrace it and increment the number. Nobody cares a Chrome update gives a major version change of 136->137 instead of 12.4.2.33 -> 12.4.3.0 for similar kinds of "the version number doesn't always have to represent the amount of work/improvement compared to the previous" reasoning. |
| |
| ▲ | saubeidl a day ago | parent | next [-] | | It feels like LLM progress in general has kinda stalled and we're only getting small incremental improvements from here. I think we've reached peak LLM - if AGI is a thing, it won't be through this architecture. | | |
| ▲ | nico a day ago | parent | next [-] | | Diffusion LLMs seem like they could be a huge change Check this out from yesterday (watch the short video here): https://simonwillison.net/2025/May/21/gemini-diffusion/ From: https://news.ycombinator.com/item?id=44057820 | | | |
| ▲ | bcrosby95 a day ago | parent | prev | next [-] | | Even if LLMs never reach AGI, they're good enough to where a lot of very useful tooling can be built on top of/around them. I think of it more as the introduction of computing or the internet. That said, whether or not being a provider of these services is a profitable endeavor is still unknown. There's a lot of subsidizing going on and some of the lower value uses might fall to the wayside as companies eventually need to make money off this stuff. | | |
| ▲ | michaelbrave a day ago | parent [-] | | this was my take as well. Though after a while I've started thinking about it closer to the introduction of electricity which in a lot of ways would be considered the second stage of the industrial revolution, the internet and AI might be considered the second stage of the computing revolution (or so I expect history books to label it as). But just like electricity, it doesn't seem to be very profitable for the providers of electricity, but highly profitable for everything that uses it. | | |
| |
| ▲ | jenny91 a day ago | parent | prev | next [-] | | I think it's a bit early to say. At least in my domain, the models released this year (Gemini 2.5 Pro, etc). Are crushing models from last year. I would therefore not by any means be ready to call the situation a stall. | |
| ▲ | goatlover a day ago | parent | prev [-] | | Which brings up the question of why AGI is a thing at all. Shouldn't LLMs just be tools to make humans more productive? | | |
| ▲ | amarcheschi a day ago | parent [-] | | Think of the poor vcs who are selling agi as the second coming of christ |
|
| |
| ▲ | onlyrealcuzzo a day ago | parent | prev [-] | | Did you see Gemini 1.5 pro vs 2.5 pro? | | |
| ▲ | zamadatix a day ago | parent [-] | | Sure, but despite there being a 2.0 release between they didn't even feel the need to release a Pro for it still isn't the kind of GPT 2 -> 3 improvement we were hoping would continue for a bit longer. Companies will continue to release these incremental improvements which are all always neck-and-neck with each other. That's fine and good, just don't inherently expect the versioning to represent the same relative difference instead of the relative release increment. | | |
| ▲ | cma a day ago | parent [-] | | I'd say 2.5 pro to 1.5 pro was a 3 -> 4 level improvement, but the problem is 1.5 pro wasn't state of the art when released, except for context length, and 2.5 wasn't that kind of improvement compared to the best open AI or Claude stuff that was available when it released. 1.5 pro was worse than original gpt4 on several coding things I tried head to head. |
|
|
|
|
| ▲ | a day ago | parent | prev [-] |
| [deleted] |