Remix.run Logo
827a 2 hours ago

1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".

We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.

This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.

We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.

DrewADesign an hour ago | parent | next [-]

I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.

I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.

cogman10 34 minutes ago | parent | next [-]

> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.

I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.

Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?

pishpash an hour ago | parent | prev [-]

Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.

crsn 3 minutes ago | parent | prev | next [-]

Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscript

rdedev 26 minutes ago | parent | prev | next [-]

I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer

At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent

sharperguy 25 minutes ago | parent | prev | next [-]

So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?

So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.

tonylucas 22 minutes ago | parent [-]

I've been working on an integrated deterministic/agent integrated system for a few months now. It basically runs an AI step to build a plan, which biases towards deterministic steps as much as possible but escalates back to AI when it needs to (for AI only capabilities or deterministic failures) so effectively (when I perfect it, I'm about 90% there) it can bounce back and forward as needed with deterministic steps launching AI steps and AI steps launching deterministic steps as needed.

Probably not explaining it very well but I think it's pretty effective at reducing token usage.

woeirua an hour ago | parent | prev | next [-]

I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.

mmis1000 an hour ago | parent | prev | next [-]

> This started breaking down after ~30 files.

Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.

swores an hour ago | parent [-]

Is Codex willing to do "multi hour" tasks when used with a ChatGPT Plus subscription, or does it need something more expensive like Pro?

dnh44 23 minutes ago | parent [-]

I regularly get codex to do multi hour tasks with a single prompts I don't think thats a big deal anymore. But you don't want a single agent doing all the work. The root agent needs to delegate the work to sub agents. For example, a sub agent for context gathering, then one for planning, then one (or more) for implementation, then another for review. This way the root agent doesn't use up its context window and it just manages from a bird's eye view. I do have the $200 plan though.

Joeri an hour ago | parent | prev | next [-]

You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.

sroussey 2 hours ago | parent | prev | next [-]

I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.

tonylucas 20 minutes ago | parent | next [-]

I'm working on something similar (won't link to it as don't want people to think I'm spamming) but if you want to compare notes happy to talk.

cluckindan an hour ago | parent | prev [-]

Jira for agents?

pishpash an hour ago | parent | prev [-]

Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.

svachalek an hour ago | parent [-]

True. The prompt reads: Run the following Python: ```