Remix.run Logo
godelski 2 days ago

Unironically this can actually be a good idea. Instead of "rerunning," run in parallel. Then pick the best solution.

  Pros:
   - Saved Time!
   - Scalable! 
   - Big Bill?

  Cons:
   - Big Bill
   - AI written code
a_bonobo 2 days ago | parent | next [-]

This repo has a pattern where the in parallel jobs have different personalities: https://github.com/tokenbender/agent-guides/blob/main/claude...

stillsut 2 days ago | parent [-]

Interesting, this repo (which I'm building) is doing the same but instead of just different personalities, I'm giving each agent a different CLI-agents (aider w/ Gemini, claude code, gemini cli, etc). I've got some writeups on here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

yodsanklai 2 days ago | parent | prev | next [-]

Usually, when you re-run, you change your prompt based on the initial results. You can't just run several tasks in parallel hoping for one of them to complete.

LeafItAlone 2 days ago | parent | next [-]

>You can't just run several tasks in parallel hoping for one of them to complete.

Not only can you, some providers recommend it and their tools provide it, like ChatGPT Codex (the web tool). Can’t find where I read it but I’m pretty sure Anthropic devs said early on that they kick off the same prompt to Claude Code in multiple simultaneous runs.

Personally, I’ve had decent success from this way of working.

yodsanklai 2 days ago | parent [-]

Ok, maybe it helps somewhat. My experience is that when the agent fails or produce crappy code, it's not a matter of non-deterministic output of the LLM but rather that the task is just not suitable or the system prompt didn't provide enough information.

lossolo 2 days ago | parent [-]

Not always, sometimes just a different internal "seed" can create a different working solution.

wahnfrieden 2 days ago | parent | prev [-]

Why not?

DANmode 2 days ago | parent | prev [-]

Have you seen human-written code?

withinboredom 2 days ago | parent | next [-]

At least when you tell a human the indentation is wrong, they can fix it on the first try. Watched an AI agent last night try to fix indentation by using sed for 20 minutes before I just fixed it myself after cringing.

lonelyasacloud 2 days ago | parent | next [-]

Have seen similar issues with understanding things that are somewhat orthogonal concerns to the main thing that is being worked on.

My guess is that context = main thing + somewhat unrelated thing is too big a space for the models to perform well at this point in time.

The practical solution is to remove the need for the model to figure it out each time and instead explicitly tell it about as much as possible before hand in CLAUDE.md.

LeafItAlone 2 days ago | parent | prev | next [-]

Consider yourself lucky if you’ve never had a co-worker do something along those lines. At this point seeing a person something like that wouldn’t even phase me.

steve_adams_86 2 days ago | parent | prev | next [-]

With Claude Code you can configure hooks to ensure this is done before results are presented, or just run a linter yourself after accepting changes. If you're using something else, I'd just pull it out and lint it

aeontech 2 days ago | parent | prev | next [-]

I mean... this is a deterministic task, can just run it through autoformatter? Why ask AI to do indentation of all things?

jvanderbot 2 days ago | parent | next [-]

One time I explained that I was afraid of tesla full self driving, because while using it my tesla accelerated to 45mph in a parking lot that was parallel to the road and only separated by a curb. The pushback I got was "Why would you use FSD in a parking lot". Well, "Full", right?

Same here. It's either capable of working unsupervised or not. And if not, you have to start wondering what you're even doing if you're at your keyboard, running tools, editing code that you don't like, etc.

We're still working out the edge cases with these "Full" self driving editors. It vastly diminishes the usefulness if it's going to spend 20 minutes (and $) on stupid simple things.

godelski 2 days ago | parent | next [-]

  > We're still working out the edge cases
The difficult part is that like with FSD, it's mostly edge cases
const_cast 2 days ago | parent [-]

Driving is just mostly edge cases. I've thought about it a lot, but I think automating driving is much harder than automating even air travel.

Sure the air is 3 dimensions, but driving is too dynamic and volatile. Every single road is different, and you have to rely on heuristics meant for humans.

It's stupid easy for humans to tell what is a yellow line and what a stop sign looks like, but it's not so easy for computers. These are human tools - physical things we look at with our eyes. Not easy to measure. Whereas measurements in the air are quite easy to measure.

On top of the visual heuristics, everthing changes all the time and very fast. You look away from the road and look back and you don't know what you're gonna see. It's why texting and driving is so dangerous.

godelski a day ago | parent [-]

  > I think automating driving is much harder than automating even air travel.
This is a pretty common belief. Well supported too since we've had a high level of automation in aviation for decades. Helps that things are very monitored. 3 dimensions provides a lot of benefits given that it makes for a lower density. Not to mention people don't tend to be walking around in the sky
david38 2 days ago | parent | prev [-]

A parking lot is an excellent use of self driving.

First, I want to summon my car. Then, when leaving, if I’m in a dense area with lots of shopping, the roads can be a pain. You have to exit right, immediately get into the left lane, three lanes over, the second of the right turn only lanes, etc

theshrike79 2 days ago | parent | prev [-]

This is why I have a standing order for all LLMs to run goimports on any file they've edited. It'll fix imports and minor nags without the LLM having to waste context on removing a line there or changing ' to " somewhere.

Even better if you use an LLM with Hook support, just have the hook run formatters on the file after each edit.

david38 2 days ago | parent | prev [-]

Why the hell would anyone do this instead of using any one of dozens of purpose written tools that accept configuration files?

They take less than a second to run, can run on every save, and are free

withinboredom a day ago | parent [-]

My point exactly...

Eggpants 2 days ago | parent | prev | next [-]

I hate to break it to you, but humans wrote the original code that was stolen and used for the training set.

beambot 2 days ago | parent [-]

garbage in, garbage out...

godelski 2 days ago | parent | prev [-]

I've taught undergraduates and graduates how to code. I've contributed to Open Source projects. I'm a researcher and write research code with other people who write research code.

You could say I've seen A LOT of poorly written human generated code.

Yet, I still trust it more. Why? Well one of the big reasons is exactly what we're joking about. I can trust a human to iterate. Lack of iteration would be fine if everything was containerized and code operates in an unchanging environment[0]. But in the real world, code needs to be iterated on, constantly. Good code doesn't exist. If it does exist, it doesn't stay good for long.

Another major problem is that AI generates code that optimizes for human preference, not correctness. Even the terrible students who were just doing enough to scrape by weren't trying to mask mistakes[1], but were still optimizing for correctness, even if it was the bare minimum. I can still walk through that code with the human and we can figure out what went wrong. I can ask the human about the code and I can tell a lot by their explanation, even if they make mistakes[2]. I can't trust the AI to tell an accurate account of even its own code because it doesn't actually understand. Even the dumb human has a much larger context window. They can see all the code. They can actually talk to me and try to figure out the intent. They will challenge me if I'm wrong! And for the love of god, I'm going to throw them out if they are just constantly showering me with praise and telling me how much of a genius I am. I don't want to work with someone where I feel like at any moment they're going to start trying to sell me a used car.

There's a lot of reasons, more than I list here. Do I still prompt LLMs and use them while I write code? Of course. Do I trust it to write code? Fuck no. I know it isn't trivial to see that middle ground if all you do is vibe code or hate writing code so much you just want to outsource it, but there's a lot of room here between having some assistant and having AI write code. Like the OP suggests, someone has got to write that 10-20%. That doesn't mean I've saved 80% of my time, I maybe saved 20%. Pareto is a bitch.

[0] Ever hear of "code rot?"

[1] Well... I'd rightfully dock points if they wrote obfuscated code...

[2] A critical skill of an expert in any subject is the ability to identify other experts. https://xkcd.com/451/

thunky 2 days ago | parent [-]

> Lack of iteration

What makes you think that agents can't iterate?

> I'm going to throw them out if they are just constantly showering me with praise and telling me how much of a genius I am

You can tell the agent to have the persona of an arrogant ass if you prefer it.

godelski 2 days ago | parent | next [-]

  > What makes you think that agents can't iterate?
Please RTFA or RTF top most comment in the thread.

Can they? Yes. Will they reliably? If so, why would it be better to restart...

But the real answer to your question: personal experience

thunky 2 days ago | parent [-]

> Please RTFA

TFA says:

Engineers use Claude Code for rapid prototyping by enabling "auto-accept mode" (shift+tab) and setting up autonomous loops in which Claude writes code, runs tests, and iterates continuously.

The tool rapidly prototypes features and iterates on ideas without getting bogged down in implementation details

godelski a day ago | parent [-]

Don't cherry-pick, act in good faith. I know you can also read the top comment I suggested.

I know it's a long article and the top comment is hard to find, so allow me to help

  > Treat it like a slot machine
  >
  > Save your state before letting Claude work, let it run for 30 minutes, then either accept the result or start fresh rather than trying to wrestle with corrections. ***Starting over often has a higher success rate than trying to fix Claude's mistakes.***
*YOU* might be able to iterate well with Claude but I really don't think a slot machine is consistent with the type of iteration we're discussing here. You can figure out what things mean in context or you can keep intentionally misinterpreting. At least the LLM isn't intentionally misinterpreting
nojito a day ago | parent [-]

That’s actually an old workflow. Nowadays you spin up a thin container and let it go wild. If it messes up you simply just destroy the container, undo the git history and try again.

Takes no time at all.

tayo42 2 days ago | parent | prev [-]

Llms only work in one direction, they produce the next token only. It can't go back and edit. They would need to be able to back track and edit in place somehow

thunky 2 days ago | parent | next [-]

Loops.

Plus, the entire session/task history goes into every LLM prompt, not just the last message. So for every turn of the loop the LLM has the entire context with everything that previously happened in it, along with added "memories" and instructions.

DANmode 2 days ago | parent | prev [-]

"Somehow", like caching multiple layers of context, like all the free tools are now doing?

tayo42 2 days ago | parent [-]

That's different then seeing if it's current output made a mistake or not. It's not editing in place. Your just rolling the dice again with a different prompt

thunky 2 days ago | parent [-]

No, the session history is all in the prompt, including the LLM's previous responses.

tayo42 2 days ago | parent [-]

Appending more context to the existing prompt means it's a different prompt still... The text isn't the same

thunky 2 days ago | parent [-]

I'm not sure what your point is?

Think of it like an append only journal. To correct an entry you add a new one with the correction. The LLM sees the mistake and the correction. That's no worse then mutating the history.

tayo42 a day ago | parent [-]

thats not how it works.

You put in its context window some more information, then roll the dice again. And it produces text again token by token. Its still not planning ahead and its not looking back at what was just generated. There's no guarantee everything stays the same except the mistake. This is different then editing in place. you are rolling the dice again.

thunky a day ago | parent [-]

> its not looking back at what was just generated

It is, though. The LLM gets the full history in every prompt until you start a new session. That's why it gets slower as the conversation/context gets big.

The developer could choose to rewrite or edit the history before sending it back to the LLM but the user typically can't.

> There's no guarantee everything stays the same except the mistake

Sure, but there's no guarantee about anything it will generate. But that's a separate issue.