Remix.run Logo
minimaxir 3 days ago

A repeated trend is that Claude Code only gets 70-80% of the way, which is fine and something I wish was emphasized more by people pushing agents.

This bullet point is funny:

> Treat it like a slot machine

> Save your state before letting Claude work, let it run for 30 minutes, then either accept the result or start fresh rather than trying to wrestle with corrections. Starting over often has a higher success rate than trying to fix Claude's mistakes.

That's easy to say when the employee is not personally paying the massive amount of compute running Claude Code for a half-hour.

throwmeaway222 3 days ago | parent | next [-]

Thanks for the tip - we employees should run and re-run the code generation hundreds of times even if the changes are pretty good. That way, the brass will see a huge bill without many actual commits.

Sorry boss, it looks like we need to hire more software engineers since the AI route still isn't mathing.

mdaniel 3 days ago | parent | next [-]

> we employees should run and re-run the code generation hundreds of times

Well, Anthropic sure thinks that you should. Number go up!

drewvlaz 2 days ago | parent | next [-]

One really has to wonder what their actual margins are though, considering the Claude Code plans vs API pricing

wahnfrieden 2 days ago | parent | prev [-]

It is accurate though. I even run multiple attempts in parallel. Which is a strategy that can work with human teams too.

godelski 2 days ago | parent | prev | next [-]

Unironically this can actually be a good idea. Instead of "rerunning," run in parallel. Then pick the best solution.

  Pros:
   - Saved Time!
   - Scalable! 
   - Big Bill?

  Cons:
   - Big Bill
   - AI written code
a_bonobo 2 days ago | parent | next [-]

This repo has a pattern where the in parallel jobs have different personalities: https://github.com/tokenbender/agent-guides/blob/main/claude...

stillsut 2 days ago | parent [-]

Interesting, this repo (which I'm building) is doing the same but instead of just different personalities, I'm giving each agent a different CLI-agents (aider w/ Gemini, claude code, gemini cli, etc). I've got some writeups on here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

yodsanklai 2 days ago | parent | prev | next [-]

Usually, when you re-run, you change your prompt based on the initial results. You can't just run several tasks in parallel hoping for one of them to complete.

LeafItAlone 2 days ago | parent | next [-]

>You can't just run several tasks in parallel hoping for one of them to complete.

Not only can you, some providers recommend it and their tools provide it, like ChatGPT Codex (the web tool). Can’t find where I read it but I’m pretty sure Anthropic devs said early on that they kick off the same prompt to Claude Code in multiple simultaneous runs.

Personally, I’ve had decent success from this way of working.

yodsanklai 2 days ago | parent [-]

Ok, maybe it helps somewhat. My experience is that when the agent fails or produce crappy code, it's not a matter of non-deterministic output of the LLM but rather that the task is just not suitable or the system prompt didn't provide enough information.

lossolo 2 days ago | parent [-]

Not always, sometimes just a different internal "seed" can create a different working solution.

wahnfrieden 2 days ago | parent | prev [-]

Why not?

DANmode 2 days ago | parent | prev [-]

Have you seen human-written code?

withinboredom 2 days ago | parent | next [-]

At least when you tell a human the indentation is wrong, they can fix it on the first try. Watched an AI agent last night try to fix indentation by using sed for 20 minutes before I just fixed it myself after cringing.

lonelyasacloud 2 days ago | parent | next [-]

Have seen similar issues with understanding things that are somewhat orthogonal concerns to the main thing that is being worked on.

My guess is that context = main thing + somewhat unrelated thing is too big a space for the models to perform well at this point in time.

The practical solution is to remove the need for the model to figure it out each time and instead explicitly tell it about as much as possible before hand in CLAUDE.md.

LeafItAlone 2 days ago | parent | prev | next [-]

Consider yourself lucky if you’ve never had a co-worker do something along those lines. At this point seeing a person something like that wouldn’t even phase me.

steve_adams_86 2 days ago | parent | prev | next [-]

With Claude Code you can configure hooks to ensure this is done before results are presented, or just run a linter yourself after accepting changes. If you're using something else, I'd just pull it out and lint it

aeontech 2 days ago | parent | prev | next [-]

I mean... this is a deterministic task, can just run it through autoformatter? Why ask AI to do indentation of all things?

jvanderbot 2 days ago | parent | next [-]

One time I explained that I was afraid of tesla full self driving, because while using it my tesla accelerated to 45mph in a parking lot that was parallel to the road and only separated by a curb. The pushback I got was "Why would you use FSD in a parking lot". Well, "Full", right?

Same here. It's either capable of working unsupervised or not. And if not, you have to start wondering what you're even doing if you're at your keyboard, running tools, editing code that you don't like, etc.

We're still working out the edge cases with these "Full" self driving editors. It vastly diminishes the usefulness if it's going to spend 20 minutes (and $) on stupid simple things.

godelski 2 days ago | parent | next [-]

  > We're still working out the edge cases
The difficult part is that like with FSD, it's mostly edge cases
const_cast 2 days ago | parent [-]

Driving is just mostly edge cases. I've thought about it a lot, but I think automating driving is much harder than automating even air travel.

Sure the air is 3 dimensions, but driving is too dynamic and volatile. Every single road is different, and you have to rely on heuristics meant for humans.

It's stupid easy for humans to tell what is a yellow line and what a stop sign looks like, but it's not so easy for computers. These are human tools - physical things we look at with our eyes. Not easy to measure. Whereas measurements in the air are quite easy to measure.

On top of the visual heuristics, everthing changes all the time and very fast. You look away from the road and look back and you don't know what you're gonna see. It's why texting and driving is so dangerous.

godelski a day ago | parent [-]

  > I think automating driving is much harder than automating even air travel.
This is a pretty common belief. Well supported too since we've had a high level of automation in aviation for decades. Helps that things are very monitored. 3 dimensions provides a lot of benefits given that it makes for a lower density. Not to mention people don't tend to be walking around in the sky
david38 2 days ago | parent | prev [-]

A parking lot is an excellent use of self driving.

First, I want to summon my car. Then, when leaving, if I’m in a dense area with lots of shopping, the roads can be a pain. You have to exit right, immediately get into the left lane, three lanes over, the second of the right turn only lanes, etc

theshrike79 2 days ago | parent | prev [-]

This is why I have a standing order for all LLMs to run goimports on any file they've edited. It'll fix imports and minor nags without the LLM having to waste context on removing a line there or changing ' to " somewhere.

Even better if you use an LLM with Hook support, just have the hook run formatters on the file after each edit.

david38 2 days ago | parent | prev [-]

Why the hell would anyone do this instead of using any one of dozens of purpose written tools that accept configuration files?

They take less than a second to run, can run on every save, and are free

withinboredom a day ago | parent [-]

My point exactly...

Eggpants 2 days ago | parent | prev | next [-]

I hate to break it to you, but humans wrote the original code that was stolen and used for the training set.

beambot 2 days ago | parent [-]

garbage in, garbage out...

godelski 2 days ago | parent | prev [-]

I've taught undergraduates and graduates how to code. I've contributed to Open Source projects. I'm a researcher and write research code with other people who write research code.

You could say I've seen A LOT of poorly written human generated code.

Yet, I still trust it more. Why? Well one of the big reasons is exactly what we're joking about. I can trust a human to iterate. Lack of iteration would be fine if everything was containerized and code operates in an unchanging environment[0]. But in the real world, code needs to be iterated on, constantly. Good code doesn't exist. If it does exist, it doesn't stay good for long.

Another major problem is that AI generates code that optimizes for human preference, not correctness. Even the terrible students who were just doing enough to scrape by weren't trying to mask mistakes[1], but were still optimizing for correctness, even if it was the bare minimum. I can still walk through that code with the human and we can figure out what went wrong. I can ask the human about the code and I can tell a lot by their explanation, even if they make mistakes[2]. I can't trust the AI to tell an accurate account of even its own code because it doesn't actually understand. Even the dumb human has a much larger context window. They can see all the code. They can actually talk to me and try to figure out the intent. They will challenge me if I'm wrong! And for the love of god, I'm going to throw them out if they are just constantly showering me with praise and telling me how much of a genius I am. I don't want to work with someone where I feel like at any moment they're going to start trying to sell me a used car.

There's a lot of reasons, more than I list here. Do I still prompt LLMs and use them while I write code? Of course. Do I trust it to write code? Fuck no. I know it isn't trivial to see that middle ground if all you do is vibe code or hate writing code so much you just want to outsource it, but there's a lot of room here between having some assistant and having AI write code. Like the OP suggests, someone has got to write that 10-20%. That doesn't mean I've saved 80% of my time, I maybe saved 20%. Pareto is a bitch.

[0] Ever hear of "code rot?"

[1] Well... I'd rightfully dock points if they wrote obfuscated code...

[2] A critical skill of an expert in any subject is the ability to identify other experts. https://xkcd.com/451/

thunky 2 days ago | parent [-]

> Lack of iteration

What makes you think that agents can't iterate?

> I'm going to throw them out if they are just constantly showering me with praise and telling me how much of a genius I am

You can tell the agent to have the persona of an arrogant ass if you prefer it.

godelski 2 days ago | parent | next [-]

  > What makes you think that agents can't iterate?
Please RTFA or RTF top most comment in the thread.

Can they? Yes. Will they reliably? If so, why would it be better to restart...

But the real answer to your question: personal experience

thunky 2 days ago | parent [-]

> Please RTFA

TFA says:

Engineers use Claude Code for rapid prototyping by enabling "auto-accept mode" (shift+tab) and setting up autonomous loops in which Claude writes code, runs tests, and iterates continuously.

The tool rapidly prototypes features and iterates on ideas without getting bogged down in implementation details

godelski a day ago | parent [-]

Don't cherry-pick, act in good faith. I know you can also read the top comment I suggested.

I know it's a long article and the top comment is hard to find, so allow me to help

  > Treat it like a slot machine
  >
  > Save your state before letting Claude work, let it run for 30 minutes, then either accept the result or start fresh rather than trying to wrestle with corrections. ***Starting over often has a higher success rate than trying to fix Claude's mistakes.***
*YOU* might be able to iterate well with Claude but I really don't think a slot machine is consistent with the type of iteration we're discussing here. You can figure out what things mean in context or you can keep intentionally misinterpreting. At least the LLM isn't intentionally misinterpreting
nojito a day ago | parent [-]

That’s actually an old workflow. Nowadays you spin up a thin container and let it go wild. If it messes up you simply just destroy the container, undo the git history and try again.

Takes no time at all.

tayo42 2 days ago | parent | prev [-]

Llms only work in one direction, they produce the next token only. It can't go back and edit. They would need to be able to back track and edit in place somehow

thunky 2 days ago | parent | next [-]

Loops.

Plus, the entire session/task history goes into every LLM prompt, not just the last message. So for every turn of the loop the LLM has the entire context with everything that previously happened in it, along with added "memories" and instructions.

DANmode 2 days ago | parent | prev [-]

"Somehow", like caching multiple layers of context, like all the free tools are now doing?

tayo42 2 days ago | parent [-]

That's different then seeing if it's current output made a mistake or not. It's not editing in place. Your just rolling the dice again with a different prompt

thunky 2 days ago | parent [-]

No, the session history is all in the prompt, including the LLM's previous responses.

tayo42 2 days ago | parent [-]

Appending more context to the existing prompt means it's a different prompt still... The text isn't the same

thunky 2 days ago | parent [-]

I'm not sure what your point is?

Think of it like an append only journal. To correct an entry you add a new one with the correction. The LLM sees the mistake and the correction. That's no worse then mutating the history.

tayo42 a day ago | parent [-]

thats not how it works.

You put in its context window some more information, then roll the dice again. And it produces text again token by token. Its still not planning ahead and its not looking back at what was just generated. There's no guarantee everything stays the same except the mistake. This is different then editing in place. you are rolling the dice again.

thunky a day ago | parent [-]

> its not looking back at what was just generated

It is, though. The LLM gets the full history in every prompt until you start a new session. That's why it gets slower as the conversation/context gets big.

The developer could choose to rewrite or edit the history before sending it back to the LLM but the user typically can't.

> There's no guarantee everything stays the same except the mistake

Sure, but there's no guarantee about anything it will generate. But that's a separate issue.

gmueckl 2 days ago | parent | prev | next [-]

Data centers are CapEx, employees are OpEx. Building more data centers is cheap. Employees can always supervise more agents...

zer00eyz 2 days ago | parent [-]

Data centers are cap ex

Except the power and cooling demands of the current crop of GPU's means you are not fitting full density in a rack. There is a real material increase in fiber use because of your now more distributed equipment. (because 800gbps interconnects are NOT cheap).

You can't capitalize power costs: this is now a non-trivial cost to account for. And the more power you use for compute the more power you have to use for cooling... (Power density is now so high that cooling with something other than air is looking not just attractive but like it is going to be a requirement.)

Meanwhile the cost of lending right now is high compared to recent decades...

The accounting side of things isnt as pretty as one would like it to be.

gmueckl 2 days ago | parent [-]

That is a much more grounded reply than my comment deserved. Thanks!

Graziano_M 2 days ago | parent | prev [-]

Don’t forget to smash the power looms as well.

aprilthird2021 2 days ago | parent [-]

Is it OK to be a Luddite?

https://archive.nytimes.com/www.nytimes.com/books/97/05/18/r...?

aspenmayer 20 hours ago | parent [-]

Must we choose between Leviathan or oblivion? Is a steady state society possible, or even desirable?

https://en.wikipedia.org/wiki/William_Ophuls#Leviathan_or_ob...

preommr 3 days ago | parent | prev | next [-]

> A repeated trend is that Claude Code only gets 70-80% of the way, which is fine and something I wish was emphasized more by people pushing agents.

I have been pretty successful at using llms for code generation.

I have a simple rule that something is either 90%>ai or none at all (exluding inline completions, and very obvious text editing).

The model has an inherent understanding of some problems due to it's training data (e.g. setting up a web server with little to no deps in golang), that it can do with almost 100% certainty, where it's really easy to blaze through in a few minutes, and then I can setup the architecture for some very flat code flows. This can genuinely improve my output by 30%-50%

MPSimmons 2 days ago | parent | next [-]

Agree with your experiences. I've also found that if I build a lightweight skeleton of the structure of the program, it does a much better job. Also, ensuring that it does a full fledged planning/non-executing step before starting to change things leads to good results.

I have been using Cline in VSCode, and I've been enjoying it a lot.

randmeerkat 2 days ago | parent | prev [-]

> I have a simple rule that something is either 90%>ai or none at all…

10% is the time it works 100% of the time.

maerch 2 days ago | parent | prev | next [-]

> A repeated trend is that Claude Code only gets 70-80% of the way, which is fine and something I wish was emphasized more by people pushing agents.

Recently, I realized that this applies not only to the first 70–80% of a project but sometimes also to the final 70-80%.

I couldn’t make progress with Claude on a major refactoring from scratch, so I started implementing it myself. Once I had shaped the idea clearly enough but in a very early state, I handed it back to Claude to finish and it worked flawlessly, down to the last CHANGELOG entry, without any further input from me.

I saw this as a form of extensive guardrails or prompting-by-example.

golergka 2 days ago | parent | next [-]

That’s why I like using it and get more fulfilment from coding than before: I do the fun parts. AI does the mundane.

bavell 2 days ago | parent | prev [-]

I need to try this - started using Claude code a few days ago and have been struggling to get good implementations with some high-complexity refactors. It keeps over engineering and creating more problems than it solves. It's getting close though, and I think your approach would work very well for this scenario!

LeafItAlone 2 days ago | parent | next [-]

The best way I’ve found to interact with it is to treat it like an overly eager junior developer who just read books like Gang of Four and feels the need to prove their worth as a senior. Explain that simplicity matters, or you have an existing pattern to follow, or even more specific.

As I’ve worked with a number of people like what I’ve described above, the way I’ve worked with them has helped me get better results from LLMs for coding. The difference is that you can help a junior grow over time. LLMs forget after that context (Claude.md helps, but not perfect).

theshrike79 2 days ago | parent | prev [-]

Claude has a tendency to reinvent the wheel heavily.

It'll create a massive bespoke class to do something that is already in the stdlib.

But if there's a pattern of already using stdlib functions, it can copy that easily.

lonelyasacloud 2 days ago | parent [-]

Basically, to get the best out Claude (or any of the other agents) it if it is possible to tell it about something ahead of time time then it is generally wise to, be that in seed skeleton code, comments, CLAUDE.md etc

benreesman 2 days ago | parent | prev | next [-]

The slot machine thing has a pretty compelling corollary: crank the formal systems rigor up as high as you can.

Vibe coding in Python is seductive but ultimately you end up in a bad place with a big bill to show for it.

Vibe coding in Haskell is a "how much money am I willing to pour in per unit clean, correct, maintainable code" exercise. With GHC cranked up to `-Wall -Werror` and some nasty property tests? Watching Claude Code try to weasel out with a mock goes from infuriating to amusing: bam, unused parameter! Now why would the test suite be demanding that a property holds on an unused parameter...

And Haskell is just an example, TypeScript is in some ways even more powerful in it's type system, so lots of projects have scope to dabble with what I'm calling "hyper modern vibe coding": just start putting a bunch of really nasty fastcheck and generic bounds on stuff and watch Claude Code try to cheat. Your move, Claude Code, I know you want to check off that line on the TODO list like I want to breathe, so what's it gonna be?

I find it usually gives up and does the work you paid for.

kevinventullo 2 days ago | parent | next [-]

Interesting, I wonder if there is a way to quantify the value of this technique. Like give Claude the same task in Haskell vs. Python and see which one converges correctly first.

2 days ago | parent | prev [-]
[deleted]
AzzyHN 2 days ago | parent | prev | next [-]

Not to mention, if an employee could usually write pretty good code but maybe 30% of the time they wrote something so non-functional it had to be entirely scrapped, they'd be fired.

melagonster 2 days ago | parent [-]

But what if he only want 20$/per month?

threatofrain 3 days ago | parent | prev | next [-]

This is an easy calculation for everyone. Think about whether Claude is giving you the a sufficient boost in performance, and if not... then it's too expensive. No doubt some people are in some combination of domain, legacy, complexity of codebase, etc., where Claude just doesn't cut it.

TrainedMonkey 2 days ago | parent | prev | next [-]

$200 per month will get you roughly 4-5 hours of non-stop single-threaded usage per day.

A bigger issue here is that the random process is not a good engineering pattern. It's not repeatable, does not drive coherent architecture, and struggles with complex problems. In my experience, problem size correlates inversely with generated code quality. Engineering is a process of divide-and-conquer and there is a good reason people don't use bogo (random) sort in production.

More specifically, if you only look at the final code, you are either spending a lot of time reviewing the code or accepting the code with less review scrutiny. Carefully reviewing semi random diffs seems like a poor use of time... so I suspect the default is less review scrutiny and higher tech debt. Interestingly enough, higher tech debt might be an acceptable tradeoff if you believe that soon Code Assistants will be good enough to burn the tech debt down autonomously or with minimal oversight.

On the other hand, if the code you are writing is not allowed to fail, the stakes change and you can't pick the less review option. I never thought to codify it as a process, but here is what I do to guide the development process:

- Start by stating the problem and asking Claude Code to: analyze the existing code, restate the problem in a structured fashion, scan the codebase for existing patterns solving the problem, brainstorm alternative solutions. An enhancement here could be to have a map / list of the codebase to improve the search.

- Evaluate presented solutions and iterate on the list. Add problem details, provide insight, eliminate the solutions that would not work. A lot of times I have enough context to pick a winner here, but if not, I ask for more details about each solution and their relative pros and cons.

- Ask Claude to provide a detailed plan for the down-selected solution. Carefully review the plan (a significantly faster endeavor compared to reviewing the whole diff). Iterate on the plan as needed; after that, tell Claude to save the plan for comparison after the implementation and then to get cracking.

- Review Claude's report of what was implemented vs. what was initially planned. This step is crucial because Claude will try dumb things to get things working, and I've already done the legwork on making sure we're not doing anything dumb in the previous step. Make changes as needed.

- After implementation, I generally do a pass on the unit tests because Claude is extremely prolific with them. You generally need to let it write unit tests to make sure it is on the right track. Here, I ask it to scan all of the unit tests and identify similar or identical code. After that, I ask for refactor options that most importantly maximize clarity, secondly minimize lines of code, and thirdly minimize diffs. Pick the best ones.

Yes, I accept that the above process takes significantly longer for any single change; however, in my experience, it produces far superior results in a bounded amount of time.

P.S. if you got this far please leave some feedback on how I can improve the flow.

nightshift1 2 days ago | parent | next [-]

I agree with that list. I would also add that you should explicitly ask the llm to read the whole files at least once before starting edits because they often have tunnel vision. The project map is auto generated with a script to avoid reading too many files but the files to be edited should be fresh in the context imo.

bavell 2 days ago | parent | prev [-]

Very nice, going to try this out tomorrow on some tough refactors Claude has been struggling with!

bdangubic 3 days ago | parent | prev | next [-]

That's easy to say when the employee is not personally paying the massive amount of compute running Claude Code for a half-hour.

you can do the same for $200/month

tough 3 days ago | parent | next [-]

it has limits too, it lasted like 1-2 weeks without only (for me personally at least)

artvandelai 2 days ago | parent | next [-]

The limits are in 5 hour windows. You'd have to heavily work on 2+ projects in that window to hit the limit using ~500k tokens/min for around 4.5 hours, and even then it'll reset on the next window.

bdangubic 2 days ago | parent | prev [-]

with all dues respect you really need to learn the tools you are using which includes any usage limits (which are temporal). I run CC in 4 to 8 terminals my entire workday, every workday…

tomlockwood 2 days ago | parent | prev [-]

Yeah sweet what's the burn rate?

FeepingCreature 2 days ago | parent | prev | next [-]

Yeah my most common aider command sequence is

    > /undo
    > /clear
    > ↑ ↑ ↑ ⏎
jonstewart 3 days ago | parent | prev | next [-]

And just like a slot machine, it seems pretty clear that some people get addicted to it even if it doesn’t make them better off.

2 days ago | parent | prev | next [-]
[deleted]
paulddraper 2 days ago | parent | prev | next [-]

Who is paying?

Should be the same party as is getting the rewards of the productivity gains.

oc1 a day ago | parent | prev | next [-]

Funny thing their recommendation to save state as claude code has still no ability for restore checkpoints (like cline has) despite being many times requested. Who are they kidding.

jordanb 2 days ago | parent | prev [-]

And this is the marketing pitch from the people selling this stuff. ¯\_ (ツ)_/¯