| ▲ | 827a 2 hours ago | ||||||||||||||||
1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better". We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow. This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems. We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place. | |||||||||||||||||
| ▲ | DrewADesign an hour ago | parent | next [-] | ||||||||||||||||
I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank. I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap. | |||||||||||||||||
| |||||||||||||||||
| ▲ | crsn 3 minutes ago | parent | prev | next [-] | ||||||||||||||||
Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscript | |||||||||||||||||
| ▲ | rdedev 26 minutes ago | parent | prev | next [-] | ||||||||||||||||
I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent | |||||||||||||||||
| ▲ | sharperguy 25 minutes ago | parent | prev | next [-] | ||||||||||||||||
So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks? So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail. | |||||||||||||||||
| |||||||||||||||||
| ▲ | woeirua an hour ago | parent | prev | next [-] | ||||||||||||||||
I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect. | |||||||||||||||||
| ▲ | mmis1000 an hour ago | parent | prev | next [-] | ||||||||||||||||
> This started breaking down after ~30 files. Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues. | |||||||||||||||||
| |||||||||||||||||
| ▲ | Joeri an hour ago | parent | prev | next [-] | ||||||||||||||||
You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script. | |||||||||||||||||
| ▲ | sroussey 2 hours ago | parent | prev | next [-] | ||||||||||||||||
I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually. | |||||||||||||||||
| |||||||||||||||||
| ▲ | pishpash an hour ago | parent | prev [-] | ||||||||||||||||
Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely. | |||||||||||||||||
| |||||||||||||||||