| ▲ | simonw a day ago |
| I didn't really understand the "long task" thing until I actually experienced it. The problem is finding a task you can set an agent that justifies working for that long. I finally hit one when I tried porting that Python HTML5 parser to JavaScript by pointing Codex CLI at the 9,200 html5lib-tests test suite: https://simonwillison.net/2025/Dec/15/porting-justhtml/ It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force. |
|
| ▲ | dwohnitmok a day ago | parent | next [-] |
| To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuring the difficulty of tasks by how long it takes a human to do the same task. This benchmark is saying that Opus 4.5 can now do tasks (related to AI R&D, coding foremost among them) that take human experts > 4 hours (at a 50% reliability level; whether that's actually useful depends on of course the cost of failure). It is silent on how long it takes AI systems to do those tasks. In theory an AI system could take longer than that (in practice it's usually significantly shorter). This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing. Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week). |
|
| ▲ | ehnto a day ago | parent | prev | next [-] |
| I think you might be misunderstanding the article actually, this is about AI solving tasks as measured by how long it takes a human to solve the task. The AI could potentially solve it much quicker, but the use of "human time to solve" is an attempt to create a metric that reveals long horizon complexity (as I understand it anyway). It's interesting because like the article notes, AI is really smashing benchmarks, but actual usefulness in automation of thought work is proving much more elusive. I think that collective experience of AI just not being that useful, or as useful as benchmarks suggest it should be, is captured in this metric. |
| |
| ▲ | rishabhaiover 14 hours ago | parent [-] | | I've practiced a healthy skepticism of the recent boom but I can't reason why the long horizon time wouldn't stretch to 8 hours or a week worth's of effort from next year. After Opus-4.5, governments and organizations should really figure out a path out of this storm because we're in it now. | | |
| ▲ | theptip 5 hours ago | parent [-] | | Doubling time has been 7 months for a while, so you should expect 8h not 1 week next year. | | |
| ▲ | rishabhaiover an hour ago | parent | next [-] | | Predictions over historical data in a landscape with fragile priors doesn't seem like a strong metric to me (it's a useful approximation at best) | |
| ▲ | dwohnitmok 2 hours ago | parent | prev [-] | | It's significantly accelerated to 4 months since the beginning of 2025, which puts 1 week within reach if things stay on trend. But yes 7 months is the more reliable long-term trend. |
|
|
|
|
| ▲ | twotwotwo a day ago | parent | prev | next [-] |
| METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work. "Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend. In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you. |
| |
| ▲ | TobiasJBeers 21 hours ago | parent | next [-] | | The “50% time horizon” feels most actionable when you pair it with an expected-value model.
For a given task: EV ≈ (human_time_saved × $/hour) − (p_fail × cost_of_failure) − (iteration/oversight cost).
A model crossing 4h-at-50% might be hugely useful for low failure-cost work, and still net-negative for anything where rollback/debug is expensive. The missing piece is how p_fail scales with task length + how recoverable failures are. | | |
| ▲ | twotwotwo 21 hours ago | parent [-] | | Yeah--it's difficult to go from a benchmark involving the model attempting things alone to the effect assisting people on real tasks because, well, ideally you'd measure that with real people doing real tasks. Last time METR tried that (in early '25) they found a net slowdown rather than any speedup at all. Go figure! |
| |
| ▲ | nightshift1 21 hours ago | parent | prev [-] | | >which human The second graph has this under it: The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years... | | |
| ▲ | twotwotwo 21 hours ago | parent [-] | | Yeah--I wanted a short way to gesture at the subsequent "tasks that are fast for someone but not for you are interesting," and did not mean it as a gotcha on METR, but I should've taken a second longer and pasted what they said rather than doing the "presumably a human competent at the task" handwave that I did. | | |
| ▲ | nightshift1 19 hours ago | parent [-] | | I agree. After all, benchmarks don't mean much, but I guess they are fine as long as they keep measuring the same thing every time.
Also, the context matter. In my case, I see a huge difference between the gains at work vs those at home on a personal project where I don't have to worry about corporate policies, security, correctness, standards, etc. I can let the LLM fly and not worry about losing my job in record time. |
|
|
|
|
| ▲ | Jcampuzano2 13 hours ago | parent | prev | next [-] |
| How are you guys even doing long tasks with plain Codex or Claude code? I use Claude code and I get hit with a permissions prompt every 2 seconds for anything I try to do. Sure I can turn off all dangerous permissions but it'd probably honestly stop and claim it's finished well before it actually is in most cases from my experience. To be fair I haven't tried codex so maybe it's better at this but I'm my experience almost every model stops at some point and claims victory or stops and tells me something like "next we'll continue on with XYZ" at which point I have to prompt it to continue. |
| |
| ▲ | stavros 10 hours ago | parent | next [-] | | Codex (at least 5 and 5.1) is bad at asking for permission. Whenever it wants to run pre-commit or platformio, it tries to do that, that fails because of the sandbox, and then Codex decides something is wrong with the cache directory and keeps asking for permission to sudo chown ~/.cache, every time. I have to specifically tell it to request permission for the command it wants to run, and then it works. Very annoying, and very annoying that it can't persist the permission, like Claude Code can, so it doesn't have to ask again every single time. | |
| ▲ | simonw 12 hours ago | parent | prev [-] | | You have to use --yolo or --dangerously-skip-permissions options. Thankfully the cloud versions (Claude Code for web, Codex Cloud) run like that already, and are relatively safe in that if anything goes wrong it happens on someone else's computer. |
|
|
| ▲ | lifis 20 hours ago | parent | prev | next [-] |
| Quickly looking at the source code, mostly treeBuilder and tokenizer, I do see several possible improvements:
- Use Typescript instead of JavaScript
- Use perfect hashes instead of ["a', "b", "c"].includes() idioms, string equalities, Seys, etc.
- Use a single perfect hash to match all tags/attribute names and then use enums in the rest of the codebase
- Use a single if (token.kind === Tag.START instead of repeating that for 10 consecutive conditionals
- Don't return the "reprocess" constant, but use an enum or perhaps nothing if "reprocess" is the only option
- Try tail recursion instead of a switch over the state in the tokenizer
- Use switches (best after a perfect hash lookup) instead of multiple ifs on characters in the tokenizer
- "treeBuilder.openElements = treeBuilder.open_elements;" can't possibly be good code Perhaps the agent can find these themselves if told to make the code perfect and not just pass tests |
| |
| ▲ | simonw 14 hours ago | parent [-] | | Thanks for the feedback - I pasted it into a Claude Code session on my phone, here's the resulting PR: https://github.com/simonw/justjshtml/pull/7 I didn't include the TypeScript bit though - it didn't use TypeScript because I don't like adding a build step to my JavaScript projects if I can possible avoid it. The agent would happily have used TypeScript if I had let it. I don't like that openElements = open_elements pattern either - it did that because I asked it for a port of a Python library and it decided to support the naming conventions for both Python and JavaScript at once. I told it to remove all of those. I had it run a micro benchmark too against the before and after - here's the code it used for that: https://github.com/simonw/justjshtml/blob/a9dbe2d7c79522a76f... BEFORE benchmark:
Input: 87,707 bytes
Average: 7.846 ms
Ops/sec: 127.5
After applying your suggestions: AFTER:
Average: 7.769ms
Ops/sec: 128.7 (1% improvement)
It pushed back against the tail recursion suggestion:> The current implementation uses a switch statement in step(). JavaScript doesn’t have proper tail call optimization (only Safari implements it), so true tail recursion would cause stack overflow on large documents. |
|
|
| ▲ | tacitusarc a day ago | parent | prev | next [-] |
| My problem with the OpenAI models (GPT5.2 in particular) recently is an extreme aversion to doing more than the smallest step in a task before asking for using input. Even if I explicitly instruct it to continue without input until the task is complete, it ignores the instruction. I cannot imagine GPT5.2 working on a task for more than 2 minutes, let alone 4 hours. I’m curious if you’ve run into this and figured out a way around it? |
| |
| ▲ | simonw 21 hours ago | parent | next [-] | | I've not had that problem at all with GPT-5.2 running in Codex CLI. I use prompts like this: Build a pure JavaScript library (no dependencies) for encoding and
decoding this binary format. Start by looking at how the lite3-python
library works - the JavaScript one should have the same API and probably the
same code design too. Build the JS one in lite3-javascript - it should be a
single JavaScript module which works in both Node.js and in the browser.
There should be a test script that runs with Node.js which runs against the
files in the lite3-python/format_suite folder. Write the test script first,
run it and watch it fail, then build the JavaScript library and keep running
the tests until they pass.
| | |
| ▲ | tacitusarc 9 hours ago | parent [-] | | I have not tried it in Codex CLI, I’ll give that a shot and see if it changes things. |
| |
| ▲ | macrolime 13 hours ago | parent | prev | next [-] | | I find that surprising. GPT 5.2 is the model I've had working the longest. It frequently works more than 4 hours nonstop, while earlier models would stop to ask if they should continue every 10 minutes. 5.1 and earlier ignores it if I ask it to continue until a task is done, but 5.2 will usually finish it. | |
| ▲ | BoiledCabbage a day ago | parent | prev [-] | | What agent framework are you using? It can differ from one to the next on the same model. | | |
|
|
| ▲ | visarga 21 hours ago | parent | prev | next [-] |
| You should take into consideration the time it took to make those 9200 tests originally. If you have good test coverage the agent can go much farther ahead. |
| |
| ▲ | dangus 20 hours ago | parent [-] | | Heh, I mostly use AI in the opposite direction to write tests because: 1. That’s the part of development work I hate the most and never really clicked with me 2. AI to to this point seems to be better at writing tests than code Take this with the grain of salt that: 1. I suck 2. My work is mostly in the realm of infrastructure where testing has always been weird and a little dumb | | |
| ▲ | 9rx 15 hours ago | parent [-] | | AI has become very good at writing pointless and bad tests, at least. It remains difficult to compel it to write good tests consistently. But even if it wrote great tests every time, the trouble is that testing was designed around the idea of "double entry accounting". Even great tests can test the wrong thing. In the old world you would write a test case and then implement something to satisfy the same. If both sides of the ledger agree, so to speak, you can be pretty confident that both are correct. — In other words, going through the process of implementation gives an opportunity to make sure the test you wrote isn't ill-conceived or broken itself. If you only write the tests, or only write the implementation, or write none of it, there is no point at which you can validate your work. If you have already built up an application and are reusing its test suite to reimplement the software in another language, like above, that is one thing, but in greenfield work it remains an outstanding problem of how to validate the work when you start to involve AI agents. Another article posted here recently suggests that we can go back to manual testing to validate the work... But that seems like a non-solution. | | |
| ▲ | visarga 8 hours ago | parent [-] | | Every error is a signal you need better tests. You can let the LLM create tests for every error it stumbles into, besides all the regular tests it can write on its own. Add all test scenarios you can think of, since you are not implementing them by hand. A bad test is invalidated by code, and a bad code invalidated by the test, so between them the AI agent can become reliable. |
|
|
|
|
| ▲ | hatefulheart 21 hours ago | parent | prev | next [-] |
| Simon have you got to the point where you just don’t read the article? Others have pointed out your interpretation of long task is not the same as the article. Maybe this is the negative effects of excessive LLM usage that are spoken about. |
| |
| ▲ | simonw 21 hours ago | parent [-] | | They were right. I hadn't read enough of the article to understand what was meant by multi-hour tasks. I upvoted them for pointing that out. | | |
| ▲ | lelanthran 12 hours ago | parent [-] | | >> Maybe this is the negative effects of excessive LLM usage that are spoken about. > I upvoted them for pointing that out. I'm also curious about what you think about the GPs question. TBH, responding after reading half an article was a common thing for most people pre-LLM anyway. | | |
| ▲ | simonw 9 hours ago | parent [-] | | Yeah, show me a Hacker News user who's never posted a comment on a story without properly reading it (or even without clicking the link). LLMs have nothing to do with it. If I had piped the article through an LLM first, I wouldn't have made the embarrassing mistake in that comment! |
|
|
|
|
| ▲ | noosphr a day ago | parent | prev [-] |
| What's more amazing is how fast your account empties when they do that. |
| |
| ▲ | fragmede a day ago | parent [-] | | it's $200/month for the "unlimited" plan. | | |
| ▲ | noosphr a day ago | parent [-] | | It's amazing how fast your account hits usage limits. | | |
| ▲ | lelanthran 12 hours ago | parent [-] | | I think GP was being sarcastic: they did say that the plans were "unlimited". I read It's "unlimited"
and It's unlimited
quite differently. |
|
|
|