| ▲ | meander_water a day ago |
| > So we ran it head-to-head against Claude Opus 4.8: same one-shot prompt, build a 3D platformer in raw WebGL from scratch Running a single one-shot prompt is not a benchmark, not is it representative of any sort of real-world usage. Most agent usage is collaborative so you need to test things like reliability (when I delegate a task, does it complete it without making up test results for e.g.) and steerability (does it obey my instructions or does it just do what it thinks is best). |
|
| ▲ | jameswhitford a day ago | parent | next [-] |
| Hi, I am the author, I completely agree! I set out to run a vibe test on this one, not a benchmark, the real benchmarks are listed. My test shows what the models can do when both tasked with a long-running, technically difficult, one-shot task. I think your test you describe (collaborative, task delegation, task completion, TTD, steerability) is a great format for a future test that I will definitely try out. |
| |
| ▲ | wongarsu a day ago | parent | next [-] | | Tbf, most of the "real benchmarks" have issues that are just as bad. Assessing LLM performance is just hard | | |
| ▲ | oceansky a day ago | parent [-] | | And personal too. Different engineers are using them for different use cases. |
| |
| ▲ | meander_water a day ago | parent | prev | next [-] | | Thanks, I didn't mean to be brusque, but I have seen a lot of these vibe tests lately that come to grand conclusions like "X model is better than Y" from the result of a single prompt. Appreciate you sharing the results of your tests though! | | | |
| ▲ | ramraj07 a day ago | parent | prev [-] | | The important point is that your benchmark is pretty much irrelevant for the actual usage. Thus whatever conclusion you draw is not just irrelevant but misleading. |
|
|
| ▲ | esperent a day ago | parent | prev | next [-] |
| On the other hand, I did just leave my pi agent running GPT 5.5 overnight on a clearly defined, long running task. It's been running about 10 hours now and it's mostly done. So this kind of use case is also valid. Thinking about it, I would say that the majority of agentic work I do, by a long shot, is subagents which are launched from the main session, using a prompt of its choosing. Those could be considered short versions of these fully autonomous tasks. |
| |
| ▲ | jameswhitford a day ago | parent | next [-] | | Yes, part of the reason I chose the one-shot test was really to test long-running tasks. A lot of people seem to be experimenting with this format, for example in the now trending loop-writing workflows. And really I am interested in diving into the murky waters of these novel workflows. | |
| ▲ | thunspa a day ago | parent | prev [-] | | Care to share more about your pi setup? I've recently started using it (after long-time Claude Code work) and was wondering how you'd achieve these long-running tasks. Do you allow it to spawn sub-agents? Thank you! | | |
| ▲ | esperent a day ago | parent [-] | | My pi usage over the past ~5 months went roughly like this: * Install pi and a bunch of extensions from their package repo * Realize that all the packages (with a few exceptions) are massively overcomplicated and vibe coded * Ask pi to rebuild a very simple version of the packages I used. So e.g. subagents - all the default subagent extensions are massively complicated with named agents, recursion, communication. I made one that stripped all that out. * Then whenever I hit an annoyance, spin up a parallel session and fix it. It's less work than it appears because I have ~5 extensions: hooks, subagents, background processes, a custom footer, a loop command... Maybe that's it. Within a couple of days you can have a setup pretty close to Claude Code but with a fraction of the base context use. After gradual improvements over a few weeks/months you'll have a system far better, tuned to your exact preference. Of course, just like Linux or any other highly tunable system equally important is having the restraint to not spend all your time tuning it. I've definitely had a couple of days where I was bored with my real work and did that, but whatever, it beats browsing reddit. As for getting long running tasks, I set a looping message every ~20m and tell the agent to strictly track progress in a session doc, then reread and continue after each compaction. | | |
| ▲ | ethanpil 21 hours ago | parent | next [-] | | I'd like to study your setup. Would you be willing to share?
Perhaps a github repo of your 5 extensions or even a pastebin if you would be so inclined. I would be grateful to learn more about this by studying from your success... | | |
| ▲ | esperent 7 hours ago | parent [-] | | I might share it at some point but I think it's quite similar to a lot of others out there, except that it's very specific to my personal projects and goals. If I shared it I'd need to spend at least a while cleaning up and improving docs. It's one of the reasons I suggest you study the famous setups (oh my pi, or superhuman skills etc.) and convert them to your personal needs. |
| |
| ▲ | ijidak a day ago | parent | prev [-] | | What type of task are you running for ten hours? Is this a programming task? I've not come across a programming task that would take an LLM ten hours. | | |
| ▲ | esperent 7 hours ago | parent | next [-] | | There's quite a few tasks I've found that work like this, although if course most tasks don't and require a much higher degree of interaction. The prime examples are read only audits of very large codebases, and that's what I was was running overnight. One file per subagent, each subagent writes a report with recommendations. Since it's pi and the subagents have very now scope, looking at them they ranged between 7-40k context use per subagent. I've found codex maxes out at about 50 concurrent subagents before I start getting rate limited, so the coordinating session is instructed to run them in batches of 50. My subagent extension is set up to make this as efficient as possible, the subagents can share a prefix and suffix prompt then a list of name + specific prompt in json format. Overnight it ran ~800 tiny auditors. I then run synthesis on the written audit files, extract bugs, then another round to find which ones have a common source, group them by priority etc. I've cautiously started doing larger tasks that are not just read only, for example I was dealing with a large codebase full of lint and type errors, so I sent out waves of workers with clear instructions to only fix obvious/trivial issues to and otherwise to append to a todo file for my review. That worked well and cleared a few thousand issues over several hours. I don't really want to share any other tasks I've worked on this way because it'll draw out the agentic coding sceptics and I'm not interested in defending my workflow. | |
| ▲ | nfriedly a day ago | parent | prev [-] | | I'm not the person you asked, but if they're running in their own local hardware, then it might just be a lot slower than what the big providers run their models on. System RAM is a lot cheaper than VRAM, especially if you bought it last year. | | |
|
|
|
|
|
| ▲ | segmondy a day ago | parent | prev | next [-] |
| One shot prompt means you give the model and input, you get an output done. This was not a one shot prompt, but an agentic task as shown by the tool calls. |
|
| ▲ | ritzaco a day ago | parent | prev | next [-] |
| sure that's why we look at a mix of formal benchmarks, one longer analysis of a side-by-side, and various other people who we trust to form an opinion, all covered in the article - not intended to be a formal benchmark, there are enough of those. |
| |
| ▲ | patates a day ago | parent [-] | | Then maybe you should add that caveat emptor to the article? You make a very strong claim at the end that the hype is mostly real, and making it clear to what extent your claim holds should help the reader. | | |
|
|
| ▲ | unliftedq a day ago | parent | prev [-] |
| Totally agree, a single one-shot prompt can't prove anything. |