Remix.run Logo
wongarsu 15 hours ago

I don't find this very compelling. If you look at the actual graph they are referencing but never showing [1] there is a clear improvement from Sonnet 3.7 -> Opus 4.0 -> Sonnet 4.5. This is just hidden in their graph because they are only looking at the number of PRs that are mergable with no human feedback whatsoever (a high standard even for humans).

And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped

1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

yorwba 14 hours ago | parent | next [-]

Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

thesz 3 hours ago | parent | next [-]

  > until all potential sources of error are close to being eliminated
This is what PSP/TSP did - one has to (continually) review its' own work to identify most frequent sources of (user facing) defects.

  >  if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.
This is also one of tenets of PSP/TSP. If you have a task with estimate longer that a day (8 hours), break it down.

This is fascinating. LLM community discovers PSP/TSP rules that were laid over more than twenty years ago.

What LLM community miss is that in PSP/TSP it is an individual software developer who is responsible to figure out what they need to look after.

What I see is that it is LLM users who try to harness LLMs with what they perceive as errors. It's not that LLMs are learning, it is that users of LLMs are trying to stronghold these LLMs with prompts.

Bombthecat 6 hours ago | parent | prev [-]

That's how the public perceive it though.

It's useless and never gets better until it suddenly, unexpecty got good enough.

ForHackernews 5 hours ago | parent [-]

My robo-chauffer kept crashing into different things until one day he didn't.

roxolotl 14 hours ago | parent | prev [-]

I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.