Remix.run Logo
namaria 7 hours ago

As usual with these bombastic blog posts, you have to follow the paper trail to find the caveats that deflate the whole thing.

From the papers about the tasks used in this estimation, we can easily find out that:

LLMs never exceed 20% success rate on tasks taking more than 4h:

HCAST: Human-Calibrated Autonomy Software Tasks https://metr.org/hcast.pdf

LLMs hit a score ceiling at around 1-2h mark; LLM performance drops off steeply with increase in LOC count, never breaching 50% success rate after 800 lines and below 40% at the longest one at ~1600 lines.

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts https://arxiv.org/pdf/2411.15114

And from the METR paper itself, only one model exceeds the 30 min "human task length" with 50% success, reaching that rate at 59 min "human task length", and at 80% no models but one can go over 4 min and one gets to 15 min "human task length" at that rate.

It goes on to talk about extrapolating from this SOTA 59 min "time horizon" to ~168h, arguing that this is about one month of full time work and models that could breach 50% success rates at that time span could be considered transformative because they "would necessarily exceed human performance both at tasks including writing large software applications or founding startups (clearly economically valuable), and including novel scientific discoveries."

Now despite the fact that "The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8/16. For comparison, a task like ’write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task."

But according to the METR paper, on the 22 '50% most messy' >1h tasks no model even breaches 20% success rates. And that is for tasks above "messiness = 3.0" on a set of tasks that goes to 3.2 "messiness".

So there is absolutely no record of LLMs exceeding 50% success rates above 3.0 messiness ratings 1h long tasks but they are happy to claim that they see a trend towards 50% success rates at 168h long tasks approaching 9-15/16 messiness ratings? That's even assuming that their own estimate for the messiness of 'writing a good research paper' tops the chart is comparable to 'novel scientific discoveries', or 'writing large software applications or founding startups', which would seem to be many times messier than 'writing a good research paper', let alone doable in one month.

Measuring AI Ability to Complete Long Tasks https://arxiv.org/pdf/2503.14499

So let's talk about the blog post claims that are not backed by the contents in these papers:

"Claude 3.7 could complete tasks at the end of February 2025 that would take a professional software engineer about one hour."

This is incredibly misleading. Claude 3.7 is the only model that was able to achieve a 50% success rate on tasks that are estimated to take humans at least 1h to complete. We should note that the METR paper also shows that for the 50% "messier" tasks no model even breaches 20% success rate. It should be noted that the HCAST set of tasks has 189 tasks of which only 45 breach 1h baseline estimates. The METR paper uses a 'subset' of HCAST tasks but it is not clear which ones and what the baseline time cost estimates for these look like.

"o3 gives us a chance to test these projections. We can see this curve is still on trend, if not going faster than expected"

This was a rushed evaluation conducted on a set of tasks that is different from that in the original paper, making the comparison between datasets spurious.

Also, this seems relevant:

"For the HCAST tasks, the main resource constraint is in the form of a token budget which is set to a high enough number that we do not expect it to be a limiting factor to the agent’s performance. For our basic agent scaffold, this budget is 2 million tokens across input and output (including reasoning) tokens. For the token-hungry scaffolds used with o1 this budget is 8 million tokens. For o3 and o4-mini, which used the same scaffold as o1 but did not support generating multiple completions in a single request, this budget is 16 million tokens."

https://metr.github.io/autonomy-evals-guide/openai-o3-report...

Back to the blog post:

"We can actually attempt to use the METR paper to try to derive AGI timelines using the following equation:

days until AGI = log2({AGI task length} / {Starting task length}) * {doubling time}"

To find numbers to plug there, it makes some assumptions like "1h45 upper bound seems low let's boost by 1.5x" and then "but real world is messy so let's just divide the time by 10" which flies in the face of the fact that for tasks above messiness of 3/15 the models never breach 20% success rates. And considering the whole task set goes to 3.2/15 messiness score, that means that anything above that is a null data point. So this "Let’s assume true AGI tasks are 10x harder" assumption alone should drive "task length" to zero.

"Additionally, we likely need more than 50% reliability. Let’s assume we need 80% reliability, which adds a 4x penalty, plunging starting task length down further to 3min45sec."

At 80% reliability, only 2 models breach 4 minutes, with the best one approaching 15 minutes. And that is at messiness score of at most 3.2. But the best model regresses from the 2 bellow it on the overall estimation, and none breach even 20% reliability above that.

So none of the modelling is valid, but the sleight of hand is offering the AGI formula and the 168h task length value for 'AGI' which of course is spurious because we're than talking about an AGI that cannot do any task messier than a 3.2/15, and a doubling rate of 3 to 7 months, we are stuck with accepting AGI is at most 10 years away as the blog post claims. But at most we can expect models that can do a few hundred lines of code edition at a 50% reliability rate on projects that would take around a month. And the METR paper has it's own projection of a "1-month AI" (which it never claims to be AGI) with 50% chance between end of 2027 and early 2028. So let me know if you have any 1 month projects under 800 LOC that you may need, with a 50% chance to be able to use an LLM to have a 50% shot at eventually getting it right, in about 2-3 years!