I wonder if a part of the problem isn't just the misapplication of LLMs in the first place. As has been mentioned elsewhere, perhaps the agent's prompt should be to write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible. This would hopefully include validation of the agent's output as well. The overall goal would be to keep the LLM out of doing processing that could be more efficiently (and often correctly) handled programmatically.

▲

chrismarlow9 3 hours ago | parent | next [-]

100% agreed. use the non-deterministic thing that is right 90% of the time to generate a deterministic thing that is right 100% of the time. one of the key things I add to my prompts is:

- Please consult me when you encounter any ambiguous edge cases

Attaching the AI to production to directly do things with API calls is bad. For me the only use case where the app should do any AI stuff is with reading/categorizing/etc. Basically replacing the "R" in old CRUD apps. If you want to use that same new AI based "R" endpoint to auto fill forms for the "C", "U", and "D" based on a prompt that's cool, but it should never mutate anything for a customer before a human reviews it. Basically CRUD apps are still CRUD apps (and this will always be true), they just have the benefit of having a very intelligent "R" endpoint that can auto complete forms for customers (or your internal tooling/Jenkins pipelines/etc), or suggest (but never invoke) an action.

▲

vishvananda 3 hours ago | parent | prev | next [-]

I think there is a flow in most organizations from:

llm -> prompt -> result

llm -> prompt + prompt encoded as skill -> result

llm -> prompt + deterministic code encoded as skill -> result

I do think prompting to generate code early can shortcut that path to deterministic code, but we're still essentially embedding deterministic code in a non-deterministic wrapper. There is a missing layer of determinism in many cases that actually make long-horizon tasks successful. We need deterministic code outside the non-deterministic boundary via an agentic loop or framework. This puts us in a place where the non-deterministic decision making is sandwiched in between layers of determinism:

deterministic agentic flows -> non-deterministic decision making -> deterministic tools

This has been a very powerful pattern in my experiments and it gets even stronger when the agents are building their own determinism via tools like auto-researcher.

▲

evilelectron 2 hours ago | parent | prev | next [-]

This is exactly how I did my last project of automating the generation of an interface library between a server that controls hardware and the mobile app.

The hardware control team delivers a spec as a document and spreadsheet. The mobile team was using that to code the interface library and validating their code against the server. I converted the document to TSV, sent some parts to Claude and have it write a parser for the TSV keeping all the nuances of human written spec. It took more than 150 iterations to get the parser to handle all edge cases and generate an intermediate output as JSON. Then Claude helped me write a code generator using some custom glue on top of Apollo to generate the code that is consumed by the mobile app.

This whole pipeline runs as part of Github actions and calls Claude only when our library validator fails. There is an md file which is sent to Claude on failure as part of the request to figure out what went wrong, propose a solution and create a PR. This is followed by a human review, rework and merge. Total credits consumed to get here < $350.

▲

groovetandon an hour ago | parent | prev | next [-]

This is so true have been working on a project for exactly this principle -

https://www.decisional.com/blog/workflow-automation-should-b...

I think there is a fundamental incentive problem - code + llm + harness is bound to be more efficient but the labs want you to burn tokens so they are not going to tell you to use the code, just burn more tokens. They are asking us to forget about the token cost and reliability for now - model will become better.

This means that most people just believe that their agent should just be able to do anything with the help of some Model fairy dust with prompts + skills.

People need to watch their agents fail in production to be able to come to the right conclusion unfortunately.

▲

VMG 4 hours ago | parent | prev | next [-]

The problem is that often the program runs into some edge case that requires interpretation, at which point one is tempted to let the LLM deal with the edge case, at which point one is tempted to let the LLM deal with the whole loop and let it do the tool calls

	▲	Fishkins 3 hours ago \| parent [-]
		Agreed. I think the approach described here is promising. Most of the workflow is deterministic and includes safeguards, but an LLM is invoked in the one case where it's really useful. https://lethain.com/agents-as-scaffolding/

▲

khasan222 an hour ago | parent | prev | next [-]

Completely agree! People tend to forget we are non deterministic too! Yet we are able to write code fine, and fairly reliably by using tools that can help keep us fairly honest.

I think most problems with ai tend to be around can you deterministically test the thing you are asking it to do?

How many of us would never ever show work, without going to check the thing we just built first?

▲

cluckindan an hour ago | parent [-]

> can you deterministically test the thing you are asking it to do?

Of course: have it write tests first; and run them to check its work.

Works well for refactoring, but greenfield implementations still rely on a spec that is guaranteed to be incomplete, overcomplete and wrong in many ways.

	▲	pishpash an hour ago \| parent [-]
		You can't ask something to check its own work without external reward/penalty. It'll cheat.

▲

nixpulvis 3 hours ago | parent | prev | next [-]

My agents often write themselves scripts. Isn't that effectively what you're asking for? Prompting for scripts can also be a useful time and accuracy tactic when you know it'll be a good fit for it.

▲

falkensmaize 42 minutes ago | parent | next [-]

The problem is that code it spits out on the fly is untested and untrustworthy. Identify the parts of your workflow that could be accomplished with regular code - write and unit test that code, with LLM help if you want, and use the llm as the orchestrator only.

▲

sisve 2 hours ago | parent | prev [-]

Yeah, the problem is that I do not think the agents is good at reusing scripts and stitching it together.At least for me it's recreating to much similar. I hope we will see platforms like windmill.dev find the optimal solution for this. I have not been able to test it enough. But have a platform that gives you some observability out of the box and protect secrets from llm is nice

	▲	reddit_clone 31 minutes ago \| parent [-]
		I noticed that too. Unless you _ask_ for a script, they throw away the scripts they write. They are particularly bad at complex multiline parsing. Writing all sorts of weird/crude python/awk scripts and getting confused in the process. I wish they would use Perl6/Grammer or Haskell/Parsec or similar and write better parsing scripts.

▲

foolserrandboy 4 hours ago | parent | prev | next [-]

yup, the standard way of thinking about agents seems backwards and probably costly. Use LLMs to write scripts, then stick all your scripts in your own looping harness and call out for LLMs for those parts that are too hard to automate with some deterministic validation at the end.

▲

user3939382 2 hours ago | parent | prev [-]

> write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible

Correct. The concept of having probabilistic output with deterministic acceptance “guardrails” is illogical. If the domain resists deterministic modeling such that you’re using an LLM, the guardrails don’t magically gain that capability.