Remix.run Logo
swframe2 2 days ago

Preventing garbage just requires that you take into account the cognitive limits of the agent. For example ...

1) Don't ask for large / complex change. Ask for a plan but ask it to implement the plan in small steps and ask the model to test each step before starting the next.

2) For really complex steps, ask the model to write code to visualize the problem and solution.

3) If the model fails on a given step, ask it to add logging to the code, save the logs, run the tests and the review the logs to determine what went wrong. Do this repeatedly until the step works well.

4) Ask the model to look at your existing code and determine how it was designed to implement a task. Some times the model will put all of the changes in one file but your code has a cleaner design the model doesn't take into account.

I've seen other people blog about their tricks and tips. I do still see garbage results but not as high as 95%.

rco8786 2 days ago | parent | next [-]

I feel like I do all of this stuff and still end up with unusable code in most cases, and the cases where I don't I still usually have to hand massage it into something usable. Sometimes it gets it right and it's really cool when it does, but anecdotally for me it doesn't seem to be making me any more efficient.

enobrev 2 days ago | parent | next [-]

> it doesn't seem to be making me any more efficient

That's been my experience.

I've been working on a 100% vibe-coded app for a few weeks. API, React-Native frontend, marketing website, CMS, CI/CD - all of it without changing a single line of code myself. Overall, the resulting codebase has been better than I expected before I started. But I would have accomplished everything it has (except for the detailed specs, detailed commit log, and thousands of tests), in about 1/3 of the time.

fourthark 2 days ago | parent | next [-]

How long would it have taken if you had written “the detailed specs, detailed commit log, and thousands of tests”?

veber-alex 2 days ago | parent | next [-]

-1 time because it would have never have happened without AI

enobrev 2 days ago | parent | prev [-]

The specs would not likely have happened at all, since this is a solo project; although this experience has led me to want to write these things out more thoroughly, even for myself. It's impressive how little work I need to put in going this route to have fairly thorough actionable specs for pretty much every major decision I've made through the process.

The commits - some would be detailed, plenty would have been "typo" or "same as last commit, but works this time"

The tests - Probably would have been decent for the API, but not as thorough. Likely non-existent for the UI.

As for time - I agree with the other response - I wouldn't have taken the time.

theshrike79 a day ago | parent | prev [-]

I'm not calling bullshit here, but something smells.

If you really can write a full-ass system like that faster than an LLM, you're either REALLY fucking good at what you do (and an amazing typer), or you're holding the LLM wrong as they say.

enobrev 10 hours ago | parent [-]

I'm ok on speed. Not 10x or anything, but I've been writing full-stack web apps and websites from scratch for a quarter century.

The issue is getting the LLM to write _reasonably decent_ code without having to read every line and make sure it's not doing anything insane. I've tried a few different methods of prompting, but setting up a claude sub-agent that's doing TDD very explicitly and ensuring that all tests pass after every iteration has been most effective.

My first attempt was so fast, it was mind-bending. I had a "working" App and API running in about a day and a half. And then when I tried to adjust features, it would start changing things all over the place, LOTS of tests were failing, and after a couple prompts, I got to a point where the app was terribly broken. I spent about a day trying to prompt my way out of a seemingly infinite hole. I did a more thorough code review and it was a disaster: Random code styles, tons of half-written and abandoned code, tests that did nothing, //TODOs everywhere, and so, so many tweaks for backwards compatibility - which I did NOT need for a greenfield project

At that point I scrapped the project and adjusted my approach. I broke down the PRD into more thorough documentation for reference. I streamlined the CLAUDE.md files. I compiled a standard method of planning / documenting work to be done. I created sub-agents for planning and implementation. I set up the primary implementation sub-agents to split up the spec into bite-sized portions of work ("30-45 minute tasks").

Now I'm at the opposite side of the spectrum - implementation is dog slow, but I rarely have to read what was actually written. I still review the code at large after the primary tasks are finished (comparing the feature branch against main in my IDE), but for the most part I've been able to ignore the output and rely on my manual app tests and then occasionally switch models (or LLMs) and prompt for a thorough code-review.

jaggederest 2 days ago | parent | prev [-]

The key is prompting. Prompt to within an inch of your life. Treat prompts as source code - edit them in files, use @ notation to bring them into the console. Use Claude to generate its own prompts - https://github.com/wshobson/commands/ and https://github.com/wshobson/agents/ are very handy, they include a prompt-engineer persona.

I'm at the point now where I have to yell at the AI once in a while, but I touch essentially zero code manually, and it's acceptable quality. Once I stopped and tried to fully refactor a commit that CC had created, but I was only able to make marginal improvements in return for an enormous time commitment. If I had spent that time improving my prompts and running refactoring/cleanup passes in CC, I suspect I would have come out ahead. So I'm deliberately trying not to do that.

I expect at some point on a Friday (last Friday was close) I will get frustrated and go build things manually. But for now it's a cognitive and effort reduction for similar quality. It helps to use the most standard libraries and languages possible, and great tests are a must.

Edit: Also, use the "thinking" commands. think / think hard / think harder / ultrathink are your best friend when attempting complicated changes (of course, if you're attempting complicated changes, don't.)

thayne 2 days ago | parent | next [-]

This works fairly well for well defined, repetitive tasks. But at least for me, if you have to put that much effort into the prompt, it is likely easier just to write the code myself.

masto 2 days ago | parent | next [-]

Sometimes I spend half an hour writing a prompt and realize that I’ve basically rubber-ducked the problem to the point where I know exactly what I want, so I just write the code myself.

I have been doing my best to give these tools a fair shake, because I want to have an informed opinion (and certainly some fear of being left behind). I find that their utility in a given area is inversely proportional to my skill level. I have rewritten or fixed most of the backend business logic that AI spits out. Even if it’s mostly ok on a first pass, I’ve been doing this gig for decades now and I am pretty good at spotting future technical debt.

On the other hand, I’m consistently impressed by its ability to save me time with UI code. Or maybe it’s not that it saves me time, but it gets me to do more ambitious things. I’d typically just throw stuff on the page with the excuse that I’m not a designer, and hope that eventually I can bring in someone else to make it look better. Now I can tell the robot I want to have drag and drop here and autocomplete there, and a share to flooberflop button, and it’ll do enough of the implementation that even if I have to fix it up, I’m not as intimidated to start.

theshrike79 a day ago | parent [-]

I've had the Corporate Approved CoPilot + Sonnet 4 write a full working React page for me based on a screenshot of a Figma model. (Not even through an MCP)

It even discovered that we have some internal components and used them for it.

Got me from 0-MVP in less then an hour. Would've easily taken me a full day

NitpickLawyer 2 days ago | parent | prev | next [-]

I've found it works really well for exploration as well. I'll give it a new library, and ask it to explore the library with "x goal" in mind. It then goes and agents away for a few minutes, and I get a mini-poc that more often than not does what I wanted and can also give me options.

xenobeb 2 days ago | parent | prev [-]

I am certain it has much to do with being in the training data or not.

I have loved GPT5 but the other day I was trying to implement a rather novel idea that would be a rather small function and GPT5 goes from a genius to an idiot.

I think HN has devolved into random conversations based on a random % of problems being in the training data or not. People really are having such different experiences with the models based on the novelty of the problems that are being solved.

At this point it is getting boring to read.

rco8786 2 days ago | parent | prev | next [-]

Have you made any attempt to quantify your efficiency/output vs writing the code yourself? I've done all of these things you've mentioned, with varying degrees of success. But also everything you're talking about doing is time consuming and eats away at whatever efficiency gain CC claims to offer.

jaggederest a day ago | parent [-]

Days instead of weeks, basically. Hard to truly quantify but I'm bloody minded enough to reimplement things three times to check and even with foresight the AI is faster.

shaunxcode 2 days ago | parent | prev | next [-]

I am convinced that this comment once read aloud in the cadence of Ginsberg is a work of art!

jaggederest 2 days ago | parent [-]

Now I'm trying to find a text-to-Ginsberg translator. Maybe he's who I sound like in my head.

fragmede 2 days ago | parent | prev [-]

How much voice control have you implemented?

jaggederest a day ago | parent [-]

None but it's on the list! Actually using it to prototype a complete audio visual tracking and annotation tool, so feeding it back into itself is a logical next step.

nostrademons 2 days ago | parent | prev | next [-]

I've found that an effective tactic for larger, more complex tasks is to tell it "Don't write any code now. I'm going to describe each of the steps of the problem in more detail. The rough outline is going to be 1) Read this input 2) Generate these candidates 3) apply heuristics to score candidates 4) prioritize and rank candidates 5) come up with this data structure reflecting the output 6) write the output back to the DB in this schema". Claude will then go and write a TODO list in the code (and possibly claude.md if you've run /init), and prompt you for the details of each stage. I've even done this for an hour, told Claude "I have to stop now. Generate code for the finished stages and write out comments so you can pick up where you left off next time" and then been able to pick up next time with minimal fuss.

hex4def6 2 days ago | parent | next [-]

FYI: You can force "Plan mode" by pressing shift-tab. That will prevent it from eagerly implementing stuff.

jaggederest 2 days ago | parent [-]

> That will prevent it from eagerly implementing stuff.

In theory. In practice, it's not a very secure sandbox and Claude will happily go around updating files if you insist / the prompt is bad / it goes off on a tangent.

I really should just set up a completely sandboxed VM for it so that I don't care if it goes rm-rf happy.

adastra22 2 days ago | parent [-]

Plan mode disabled the tools, so I don’t see how it would do that.

A sandboxed devcontainer is worth setting up though. Lets me run it with —dangerously-skip-permissions

faangguyindia 2 days ago | parent | next [-]

how can it plan if it does not have access to file read, search, bash tools to investigate things? If it has access to bash tools then it's going to write code, via echo or sed.

adastra22 2 days ago | parent [-]

It has file read, search, but not bash AFAIK.

theshrike79 a day ago | parent | prev | next [-]

I've had it do it in plan mode.

Nothing dangerous, but the limits are more like suggestions, as the Pirate code says.

jaggederest 2 days ago | parent | prev [-]

I don't know either but I've seen it write to files in plan mode. Very confusing.

faangguyindia 2 days ago | parent | next [-]

It does not write anything in plan mode, it's documented here it has only readonly tools available in plan mode: https://docs.anthropic.com/en/docs/claude-code/common-workfl...

But here are fine prints, it has "exit plan mode" tool, documented here: https://minusx.ai/blog/decoding-claude-code/#appendix

So it can exit plan mode on its own and you wouldn't know!

jaggederest a day ago | parent [-]

Ok, it's done it to me 3 times today, so I don't know what to tell you. I remind it that it's in plan mode and it goes "oh no I shouldn't have modified that file then!"

oxidant 2 days ago | parent | prev | next [-]

I've never seen it write a file in plan mode either.

EnPissant 2 days ago | parent | prev [-]

That's not possible. You are misremembering.

nomoreofthat 2 days ago | parent | next [-]

It’s entirely possible. Claude’s security model for subagents/tasks is incoherent and buggy, far below the standard they set elsewhere in their product, and planning mode can use subagent/tasks for research.

Permission limitations on the root agent have, in many cases, not been propagated to child agents, and they’ve been able to execute different commands. The documentation is incomplete and unclear, and even to the extent that it is clear it has a different syntax with different limitations than are used to configure permissions for the root agent. When you ask Claude itself to generate agent configurations, as is recommended, it will generate permissions that do not exist anywhere in the documentation and may or may not be valid, but there’s no error admitted if an invalid permission is set. If you ask it to explain, it gets confused by their own documentation and tells you it doesn’t know why it did that. I’m not sure if it’s hallucinating or if the agent-generating-agent has access to internal detail details that are not documented anywhere in which the normal agent can’t see.

Anthropic is pretty consistently the best in this space in terms of security and product quality. They seem to actually care about doing software engineering properly. (I’ve personally discovered security bugs in several competing products that are more severe and exploitable than what I’m talking about here.) I have a ton of respect for Anthropic. Unfortunately, when it comes to sub agents in Claude code, they are not living up to standard they have set.

sshine 2 days ago | parent | prev | next [-]

I've seen it run commands that are naively assumed to be reading files or searching directories.

I.e. not its own tools, but command-line executables.

Its assumptions about these commands, and specifically the way it ran them, were correct.

But I have seen it run commands in plan mode.

laborcontract 2 days ago | parent | prev | next [-]

No, it is possible. I just got it to write files both using Bash and its Write tools while in plan mode right now.

jaggederest a day ago | parent | prev [-]

3 times today. I don't know what to say besides it tries to edit files in plan mode often for me

yahoozoo 2 days ago | parent | prev [-]

How does a token predictor “apply heuristics to score candidates”? Is it running a tool, such as a Python script it writes for scoring candidates? If not, isn’t it just pulling some statistically-likely “score” out of its weights rather than actually calculating one?

astrange 2 days ago | parent | next [-]

Token prediction is the interface. The implementation is a universal function approximator communicating through the token weights.

imtringued 2 days ago | parent | prev [-]

You can think of the K(=key) matrix in attention as a neural network where each token is turned into a tiny classifier network with multiple inputs and a single output.

The softmax activation function picks the most promising activations for a given output token.

The V(=value) matrix forms another neural network where each token is turned into a tiny regressor neural network that accepts the activation as an input and produces multiple outputs that are summed up to produce an intermediate token which is then fed into the MLP layer.

From this perspective the transformer architecture is building neural networks at runtime.

But there are some pretty obvious limitations here: The LLM operates on tokens, which means it can only operate on what is in the KV-cache/context window. If the candidates are not in the context window, it can't score them.

yahoozoo a day ago | parent [-]

I’m not sure if I’m just misunderstanding or we are talking about two different things. I know at a high level how transformers/LLMs decide its next token in the response it is generating.

My question to the post I replied to was basically: given a coding problem, and a list of possible solutions (candidates), how can a LLM generate a meaningful numerical score for each candidate to then say this one is a better solution than that one?

plaguuuuuu 2 days ago | parent | prev | next [-]

I've been using a few LLMs/agents for a while and I still struggle with getting useful output from it.

In order for it not to do useless stuff I need to expend more energy on prompting than writing stuff myself. I find myself getting paranoid about minutia in the prompt, turns of phrase, unintended associations in case it gives shit-tier code because my prompt looked too much like something off experts-exchange or whatever.

What I really want is something like a front-end framework but for LLM prompting, that takes away a lot of the fucking about with generalised stuff like prompt structure, default to best practices for finding something in code, or designing a new feature, or writing tests..

Mars008 2 days ago | parent [-]

> What I really want is something like a front-end framework but for LLM prompting

It's not simple to even imagine ideal solution. The more you think about it the more complicated your solution becomes. Simple solution will be restricted to your use cases. Generic is either visual or a programming language. I's like to have visual constructor, graph of actions, but it's complicated. The language is more powerful.

dontlaugh 2 days ago | parent | prev | next [-]

At that point, why not just write the code yourself?

lucasyvas 2 days ago | parent | next [-]

I reached this conclusion pretty quickly. With all the hand holding I can write it faster - and it’s not bragging, almost anyone experienced here could do the same.

Writing the code is the fast and easy part once you know what you want to do. I use AI as a rubber duck to shorten that cycle, then write it myself.

jprokay13 2 days ago | parent | next [-]

I am coming back to this. I’ve been using Claude pretty hard at work and for personal projects, but the longer I do it, the more disappointed I become with the quality of output for anything bigger than a script. I do love planning things out and clarifying my thoughts. It’s a turbocharged rubber duck - but it’s not a great engineer

searene 2 days ago | parent | next [-]

Me too. I’ve been playing with various coding agents such as Cursor, Claude Code, and GitHub Copilot for some time, and I would say that their most useful feature is educating me. For example, they can teach me a library I haven’t used before, or help me debug a production issue. Then I would choose to write the code by myself after I’ve figured everything out with their help. Writing code by myself is definitely faster in most cases.

bootsmann 2 days ago | parent [-]

> For example, they can teach me a library I haven’t used before.

How do you verify it is teaching you the correct thing if you don't have any baseline to compare it to?

searene a day ago | parent [-]

You are right, I don't have any baseline. I just try it and see if it works. One good thing about the software field is that I can compile and run the code for verification. It may not be optimal, but at least it's testable.

bcrosby95 2 days ago | parent | prev | next [-]

My thoughts on scripts are: the output is pretty bad too, but it doesn't matter as much in a script, because its just a short script, and all that really matters is that it kinda works.

utyop22 2 days ago | parent | prev [-]

What you're describing is a glorified mirror.

Doesn't that sound ridiculous to you?

interstice 2 days ago | parent | next [-]

That's what rubber ducking is

utyop22 2 days ago | parent [-]

It sounds better when you get more specific about what it is. Many people have fallen prey to this and gone a tad loopy.

jprokay13 2 days ago | parent | prev [-]

I am still working on tweaking how I work and design with Claude to hopefully unlock a level of output that I’m happy with.

Admittedly, part of it is my own desire for code that looks a certain way, not just that which solves the problem.

2muchcoffeeman 2 days ago | parent | prev | next [-]

I’ve been trapped in a hole of “can I get the agent to do this?” And the change would have taken me 1/10th the time.

Choosing the battles to pick is part of the skill at the moment.

I use AI for a lot of boiler plate, tedious tasks I can’t quite do a vim recording for, small targeted scripts.

skydhash 2 days ago | parent | next [-]

How many of these boilerplate do you actually have to do? Any script or complicated command that I had to write was worthy to be recorded in some bash alias or preserved somewhere. But they mostly live in my bash history or right next to the project.

The boilerplate argument is becoming quite old.

indiosmo 2 days ago | parent | next [-]

One recent example of boilerplate for me is I’ve been writing dbt models and I get it to write the schema.yml file for me based on the sql.

It’s basically just a translation, but with dozens of tables, each with dozens of columns it gets tedious pretty fast.

If given other files from the project as context it’s also pretty good at generating the table and column descriptions for documentation, which I would probably just not write at all if doing it by hand.

2muchcoffeeman 2 days ago | parent | prev [-]

I’m doing a lot of upgrades to neglected projects at the moment and I often need to do the same config over and over to multiple projects. I guess I could write a script, or get AI to write a script, but there’s no standard between projects. So I need the same thing over and over but from slightly different starting points.

I think you need to imagine all the things you could be doing with LLMs.

For me the biggest thing is so many tedious things are now unlocked. Refactors that are just slightly beyond the IDE, checking your config (the number of typos it’s picked up that could take me hours because eyes can be stupid), data processing that’s similar to what you have done before but different enough to be annoying.

shortstuffsushi 2 days ago | parent | prev [-]

A similar, non-LLM battle, is a global find and replace, but _not quite identical_ everywhere. Do I just go through the 20 files and do it myself, or try to get clever with regex? Which is ultimately faster...

baq 2 days ago | parent | next [-]

I’ve just had to do just this, a one line prompt and one example was the difference between mind numbing work and a comfortable cup of coffee away from the monitor.

2muchcoffeeman 2 days ago | parent | prev [-]

In this case LLM is probably the answer. I’ve done this exact thing. No messing with regex or manual work. Type a sentence and examine the result in a diff.

catdog 2 days ago | parent | prev [-]

Writing the code in the grand scheme of things isn't the hard part in software development. The hard parts are architecture and actually building the right thing, something an LLM can't really help you with.

It's not AI, there is no intelligence. A language model as the name says deals with language. Current ones are surprisingly good at it but it's still not more than that.

cpursley 2 days ago | parent [-]

What? Leading edge LLMs are great at architecture, schema design and that sort of thing if you give them enough context and are not working on anything too esoteric. I’d argue they are better at this than the actual coding part.

kyleee 2 days ago | parent | prev | next [-]

Partly it seems to be less taxing for the human delivering the same amount of work. I find I can chat with Claude, etc and work more. Which is a double edged sword obviously when it comes to work/life balance etc. But also I am less mentally exhausted from day job and able to enjoy programming and side projects again.

nicoburns 2 days ago | parent [-]

I guess each to their own? I can easily end up coding for 16 hours straight (having a great time) if I'm not careful. I can't imagine I'd have as much patience with an AI.

KerrAvon 2 days ago | parent [-]

I wonder if this is an introvert vs extrovert thing. Chatting with the AI seems like at least as much work as coding to me (introvert). The folks who don't may be extroverts?

halfcat 2 days ago | parent | next [-]

There is some line here. I don’t know if it’s introvert/extrovert but here are my observations.

I’ve noticed colleagues who enjoy Claude code are more interested in “just ship it!” (and anecdotally are more extroverted than myself).

I find Claude code to be oddly unsatisfying. Still trying to put my finger on it, but I think it’s that I quickly lose context. Even if I understand the changes CC makes, it’s not the same as wrestling with a problem and hitting roadblocks and overcoming. With CC I have no bearing on whether I’m in an area of code with lots of room for error, or if I’m standing in the edge of a cliff and can’t cross some line in the design.

I’m way more concerned with understanding the design and avoiding future pain than my “ship it” colleagues (and anecdotally am way more introverted). I see what they build and, yes, it’s working, for now, but the table relationships aren’t right and this is going to have to be rebuilt later, except now it’s feeding a downstream report that’s being consumed by the business, so the beta version is now production. But the 20 other things this app touches indirectly weren’t part of the vibe coding context, so the design obviously doesn’t account for that. It could, but of course the “ship it” folks aren’t the ones that are going to build out lengthy requirements and scopes of work and document how a dozen systems relate to and interact with each other.

I guess I’m seeing that the speed limit of quality is still the speed of my understanding, and (maybe more importantly) that my weaponizing of my own obsession only works when I’m wrestling and overcoming, not just generating code as fast as possible.

I do wonder about the weaponized obsession. People will draw or play music obsessively, something about the intrinsic motivation of mastery, and having AI create the same drawing, or music, isn’t the same in terms of interest or engagement.

dpkirchner 2 days ago | parent | prev [-]

I don't feel like I need to say too much to the agent to get my work done. I'm pretty dang introverted.

I just don't enjoy the work as much as I did when was younger. Now I want to get things done and then spend the day on other more enjoyable (to me) stuff.

harrall 2 days ago | parent | prev | next [-]

I don’t do much of the deep prompting stuff but I find AI can write some code faster than I can and accurately most of the time. You just need to learn what those things are.

But I can’t tell you any useful tips or tricks to be honest. It’s like trying to teach a new driver the intuition of knowing when to brake or go when a traffic light turns yellow. There’s like nothing you can really say that will be that helpful.

fragmede a day ago | parent [-]

That there's really "nothing you can say to help someone develop yellow light stop/go intuition" is just wrong. There are guiding principles that give structure, even if you can’t compress the whole skill into a single rule.

Sure, some skills are more about practice, not rules, but hopefully you're not a driving instructor.

utyop22 2 days ago | parent | prev [-]

I'm finding what's happening right now kinda bizarre.

The funny thing is - we need less. Less of everything. But an up-tick in quality.

This seems to happen with humans with everything - the gates get opened, enabling a flood of producers to come in. But this causes a mountain of slop to form, and overtime the tastes of folks get eroded away.

Engineers don't need to write more lines of code / faster - they need to get better at interfacing with other folks in the business organisation and get better at project selection and making better choices over how to allocate their time. Writing lines of code is a tiny part of what it takes to get great products to market and to grow/sustain market share etc.

But hey, good luck with that - ones thinking power is diminished overtime by interacing with LLMs etc.

mumbisChungo 2 days ago | parent [-]

>ones thinking power is diminished overtime by interacing with LLMs etc.

Sometimes I reflect on how much more efficiently I can learn (and thus create) new things because of these technologies, then get anxiety when I project that to everyone else being similarly more capable.

Then I read comments like this and remember that most people don't even want to try.

utyop22 2 days ago | parent [-]

And? Go create more stuff.

Come back and post here when you have built something that has commercial success.

Show us all how it's done.

Until then go away - more noise doesn't help.

mumbisChungo 2 days ago | parent [-]

I don't think there's anything I could tell you about the companies I've built that would dissuade you from your perspective that everyone is as intellectually lazy as your projection suggests.

skydhash 2 days ago | parent | next [-]

Not GP, but I really want to know how your process is better than anyone else. People have produced quite good software (as in solving problems) on CPU that’s less powerful than what’s on my smart plug. And whose principles is still defining today’s world.

mumbisChungo 2 days ago | parent [-]

I just find that I learn faster by interrogating (or being interrogated by) a lossy encyclopedia than I do by reading textbooks or stackoverflow.

I'm still the one doing the doing after the learning is complete.

utyop22 2 days ago | parent | prev [-]

[flagged]

MangoCoffee 2 days ago | parent | prev | next [-]

I've been vibe coding a couple of personal projects. I've found that test-driven development fits very well with vibe coding, and it's just as you said break up the problem into small, testable chunks, get the AI to write unit tests first, and then implement the actual code

yodsanklai 2 days ago | parent | next [-]

Actually, all good engineering principles which reduce cognitive load for humans work for AI as well.

BoiledCabbage 2 days ago | parent | next [-]

This is what's so funny about this. In some alternative universe I hope that LLMs never get any better. Because they force so much of good things.

They are the single closest thing we've ever had to objective evaluation on if an engineering practice is better or worse. Simply because just about every single engineering practice that I see that makes coding agents work well also makes humans work well.

And so many of these circular debates and other best practices (TDD, static typing, keeping todo lists, working in smaller pieces, testing independently before testing together, clearly defined codebase practices, ...) have all been settled in my mind.

The most controversial take, and the one I dislike but may reluctantly have to agree with is "Is it better for a business to use a popular language less suited for the task than a less popular language more suited for it." While obviously it's a sliding scale, coding agents clearly weight in on one side of this debate... as little as I like seeing it.

shortstuffsushi 2 days ago | parent | next [-]

While a lot of these ideas are touted as "good for the org," in the case of LLMs, it's more like guard rails against something that can't reason things out. That doesn't mean that the practices are bad, but I would much prefer that these LLMs (or some better mechanism) everyone is being pushed to use could actual reason, remember, and improve, so that this sort of guarding wouldn't be a requirement for correct code.

kaffekaka 2 days ago | parent [-]

The things GP listed are fundamentally good practices. If LLMs get so good they don't need even these guardrails, ok great but that is a long way off, and until then I am really happy if the outcome of AI assisted coding is that we humans get better at using these ideas for ourselves.

kaffekaka 2 days ago | parent | prev [-]

Well put, I like this perspective.

colordrops 2 days ago | parent | prev [-]

This is the big secret. Keep code modular, small, single purpose, encapsulated, and it works great with vibe coding. I want to write a protocol/meta language similar to the markdown docs that Claude et al create that is per module, and defines behavior, so you actually program and compose modules with well defined interfaces in natural language. I'm surprised someone hasn't done it already.

adastra22 2 days ago | parent | next [-]

My set of Claude agent files have an explicit set of interface definitions. Is that what you’re talking about?

colordrops 2 days ago | parent [-]

Are Claude agent files per module? If so, then I guess so.

adastra22 2 days ago | parent [-]

Per source code module?

drzaiusx11 2 days ago | parent | prev [-]

Isn't what you're describing exactly what Kiro aims to solve?

colordrops 2 days ago | parent [-]

Possibly, I've never heard of Kiro, will look into it.

alexsmirnov 2 days ago | parent | prev | next [-]

TDD is exactly that I unable to get from AI tools. Probably, because training sets always have both code and tests. I tried multiply models from all major providers, and all failed to create tests without seen the code. One workflow that helps is to create dirty implementation and generate tests for it. Then throw away the first code and use different model for final implementation.

The best way is to create tests yourself, and block any attempts to modify them

MarkMarine 2 days ago | parent | prev [-]

Works great until it’s stuck and it starts just refactoring the tests to say true == true and calling it a day. I want the inverse of black box testing, like the inside of the box has the model in it with the code and it’s not allowed to reach outside the box and change the grades. Then I can just do the Ralph Wiggum as a software engineer loop to get over the reward hacking tendencies

8n4vidtmkvmk 2 days ago | parent [-]

Don't let it touch the test file then? I usually give context to the LLM about what it's allowed to touch. I don't do big sweeping changes though. Don't trust LLM for that. For small, focused changes its great

jason_zig 2 days ago | parent | prev | next [-]

I've seen people post this same advice and I agree with you that it works but you would think they would absorb this common strategy and integrate it as part of the underlying product at this point...

noosphr 2 days ago | parent | next [-]

The people who build the models don't understand how to use the models. It's like asking people who design CPUs to build data-centers.

I've interviewed with three tier one AI labs and _no-one_ I talked to had any idea where the business value of their models came in.

Meanwhile Chinese labs are releasing open source models that do what you need. At this point I've build local agentic tools that are better than anything Claude and OAI have as paid offerings, including the $2,000 tier.

Of course they cost between a few dollars to a few hundred dollars per query so until hardware gets better they will stay happily behind corporate moats and be used by the people blessed to burn money like paper.

criemen 2 days ago | parent | next [-]

> The people who build the models don't understand how to use the models. It's like asking people who design CPUs to build data-centers.

This doesn't match the sentiment on hackernews and elsewhere that claude code is the superior agentic coding tool, as it's developed by one of the AI labs, instead of a developer tool company.

noosphr 2 days ago | parent [-]

Claude code is babies first agentic tool.

You don't see better ones from code tooling companies because the economics don't work out. No one is going to pay $1,000 for a two line change on a 500,000k line code base after waiting four hours.

LLMs today the equivalent of a 4bit ALU without memory being sold as a fully functional personal computer. And like ALUs today, you will need _thousands_ of LLMs to get anything useful done, also like ALUs in 1950 we're a long way off from a personal computer being possible.

fragmede a day ago | parent [-]

That's $500k/yr, and I guarantee there's a non-zero amount of humans out there doing exactly that and getting paid that much, because of course we know that lines of code is a dumbass metric and the problem with large mature codebases is that because they're so large and mature, making changes is very difficult, especially when trying to fix hairy customer bugs in code that has a lot of interactions.

Barbing 2 days ago | parent | prev [-]

Very interesting. And plausible.

Doesn't specifically seem to jive with the claim Anthropic made where they were worried about Claude Code being their secret sauce, leaving them unsure whether to publicly release it. (I know some skeptical about that claim.)

nostrademons 2 days ago | parent | prev | next [-]

A lot of it is integrated into the product at this point. If you have a particularly tricky bug, you can just tell Claude "I have this bug. I expected output 'foo' and got output 'bar'. What went wrong?" It will inspect the code and sometimes suggest a fix. If you run it and it still doesn't work, you can say "Nope, still not working", and Claude will add debug output to the whole program, tell you to run it again, and paste the debug output back into the console. Then it will use your example to write tests, and run against them.

tombot 2 days ago | parent | prev [-]

Claude Code at least now lets you use its best model for planning mode and its cheapest model for coding mode.

candiddevmike 2 days ago | parent [-]

The consulting world parallels here are funny

baq 2 days ago | parent [-]

Humans are agents after all

com2kid 2 days ago | parent | prev | next [-]

> 1) Don't ask for large / complex change. Ask for a plan but ask it to implement the plan in small steps and ask the model to test each step before starting the next.

I asked Claude Code to read a variable from a .env file.

It proceeded to write a .env parser from scratch.

I then asked it to just use Node's built in .env file parsing....

This was the 2nd time in the same session that it wrote a .env file parser from scratch. :/

Claude Code is amazing, but it'll goes off and does stupid even for simple requests.

NitpickLawyer 2 days ago | parent | next [-]

Check your settings, they might be unable to read .env files as a guardrail.

com2kid a day ago | parent [-]

I just reminded it to use the built in .env support and it did the right thing.

If you ignore that I had to pay for its initial failure...

theshrike79 a day ago | parent | prev [-]

It doesn't say no.

For me it built a full-ass YAML parser when it couldn't use Viper to parse the configuration correctly :)

It was a fully vibe-coded project (I like playing stupid and seeing what the LLM does), but it got caught when the config got a bit more complex and its shitty regex-yaml-parser didn't work anymore. :)

MikeTheGreat 2 days ago | parent | prev | next [-]

Genuine question: What do you mean by " ask it to implement the plan in small steps"?

One option is to write "Please implement this change in small steps?" more-or-less exactly

Another option is to figure out the steps and then ask it "Please figure this out in small steps. The first step is to add code to the parser so that it handles the first new XML element I'm interested in, please do this by making the change X, we'll get to Y and Z later"

I'm sure there's other options, too.

Benjammer 2 days ago | parent | next [-]

My method is that I work together with the LLM to figure out the step-by-step plan.

I give an outline of what I want to do, and give some breadcrumbs for any relevant existing files that are related in some way, ask it to figure out context for my change and to write up a summary of the full scope of the change we're making, including an index of file paths to all relevant files with a very concise blurb about what each file does/contains, and then also to produce a step-by-step plan at the end. I generally always have to tell it to NOT think about this like a traditional engineering team plan, this is a senior engineer and LLM code agent working together, think only about technical architecture, otherwise you get "phase 1 (1-2 weeks), phase 2 (2-4 weeks), step a (4-8 hours)" sort of nonsense timelines in your plan. Then I review the steps myself to make sure they are coherent and make sense, and I poke and prod the LLM to fix anything that seems weird, either fixing context or directions or whatever. Then I feed the entire document to another clean context window (or two or three) and ask it to "evaluate this plan for cohesiveness and coherency, tell me if it's ready for engineering or if there's anything underspecified or unclear" and iterate on that like 1-3 times until I run a fresh context window and it says "This plan looks great, it's well crafted, organized, etc...." and doesn't give feedback. Then I go to a fresh context window and tell it "Review the document @MY_PLAN.md thoroughly and begin implementation of step 1, stop after step 1 before doing step 2" and I start working through the steps with it.

lkjdsklf 2 days ago | parent [-]

The problem is, by the time you’ve gone through the process of making a granular plan and all that, you’ve lost all productivity gains of using the agent.

As an engineer, especially as you get more experience, you can kind of visualize the plan for a change very quickly and flesh out the next step while implementing the current step

All you have really accomplished with the kind of process described is make the worlds least precise, most verbose programming language

Benjammer 2 days ago | parent | next [-]

I'm not sure how much experience you have, I'm not trying to make assumptions, but I've been working in software over 15 years. The exact skill you mentioned - can visualize the plan for a change quickly - is what makes my LLM usage so powerful, imo.

I can say the right precise wording in my prompt to guide it to a good plan very quickly. As the other commenter mentioned, the entire above process only takes something like 30-120 minutes depending on scope, and then I can generate code in a few minutes that would take 2-6 weeks to write myself, working 8 hr days. Then, it takes something like 0.5-1.5 days to work out all the bugs and clean up the weird AI quirks and maybe have the LLM write some playwright tests or whatever testing framework you use for integration tests to verify it's own work.

So yes, it takes significant time to plan things well for good results, and yes the results are often sloppy in some parts and have weird quirks that no human engineer would make on purpose, but if you stick to working on prompt/context engineering and getting better and faster at the above process, the key unlock is not that it just does the same coding for you, with it generating the code instead. It's that you can work as a solo developer at the abstraction level of a small startup company. I can design and implement an enterprise grade SSO auth system over a weekend that integrates with Okta and passes security testing. I can take a library written in one language and fully re-implement it in another language in a matter of hours. I recently took the native libraries for Android and iOS for a fairly large, non-trivial SDK, and had Claude build me a React Native wrapper library with native modules that integrates both natives libraries and presents a clean, unified interface and typescript types to the react native layer. This took me about two days, plus one more for validation testing. I have never done this before. I have no idea how "Nitro Modules" works, or how to configure a react native library from scratch. But given the immense scaffolding abilities of LLMs, plus my debugging/hacking skills, I can get to a really confident place, really quickly and ship production code at work with this process, regularly.

adastra22 2 days ago | parent | prev [-]

It takes maybe 30min and then it can go off and generate code that would take literal weeks for me to write. There are still huge productivity gains being had.

lkjdsklf 2 days ago | parent [-]

That has not been my experience at all.

It takes 30-40 minutes to generate a plan and it generates code that would have taken 20-30 minutes to write.

When it’s generating “weeks” worth of code, it inevitably goes off the rails and the crap you get goes in the garbage.

This isn’t to say agents don’t have their uses, but i have not seen this specific problem actually work. They’re great for refactoring (usually) and crapping out proof of concepts and debugging specific problems. It’s also great for exploring a new code base where you have little prior knowledge.

It makes sense that it sucks at generating large amounts of code that fits cohesively into the project. The context is too small. My code base is millions of lines of code. My brain has a shitload more of that in context than any of the models. So they have to guess and check and end up incorrect and poor and i don’t. I know which abstractions exist that i can use. It doesn’t. Sometimes it guesses right. Often Times it doesn’t. And once it’s wrong, it’s fucked for the entire rest of the session so you just have to start over

adastra22 2 days ago | parent [-]

Works for me. Not vanilla Claude code though- you need to put some work into generating slash commands and workflows that keep it on task and catch the bad stuff.

Take this for example: https://www.reddit.com/r/ClaudeAI/comments/1m7zlot/how_planm...

This trick is just the basic stuff, but it works really well. You can add on and customize from there. I have a “/task” slash command that will run a full development cycle with agents generating code, many more (12-20) agent critics analyzing the unstaged work, all orchestrated by a planning agent that breaks the complex task into small atomic steps.

The first stage of this project (generating the plan) is interactive. It can then go off and make 10kLOC code spread over a dozen commits and the quality is good enough to ship, most of the time. If it goes off the rails, keep the plan document but nuke the commits and restart. On the Claude MAX plan this costs nothing.

This is how I do all my development now. I spend my time diagnosing agent failures and fixing my workflows, not guiding the agent anymore (other than the initial plan document).

I still review every line of code before pushing changes.

conception 2 days ago | parent | prev | next [-]

I tell it to generate a todo.md file with hyper atomic todos each requiring 20 loc or less. Then have it go through that. If the change is too big, generate phases (5-25) and then do the todos for each phase. That plus some sort of reference docs/high level plan keeps it going along all right.

ants_everywhere 2 days ago | parent | prev [-]

What I do is a step is roughly a reviewable commit.

So I'll say something like "evaluate the URL fetcher library for best practices, security, performance, and test coverage. Write this up in a markdown file. Add a design for single flighting and retry policy. Break this down into steps so simple even the dumbest LLM won't get confused.

Then I clear the context window and spawn workers to do the implementation.

ants_everywhere 2 days ago | parent | prev | next [-]

IMO by far the best improvement would be to make it easier for the agent to force the agent to use a success criterion.

Right now it's not easy prompting claude code (for example) to keep fixing until a test suite passes. It always does some fixed amount of work until it feels it's most of the way there and stops. So I have to babysit to keep telling it that yes I really mean for it to make the tests pass.

adastra22 2 days ago | parent | prev | next [-]

This is why the jobs market for new grads and early career folks has dried up. A seasoned developer knows that this is how you manage work in general, and just treats the AI like they would a junior developer—and gets good results.

CuriouslyC 2 days ago | parent [-]

Why bother handing stuff to a junior when an agent will do it faster while asking fewer questions, and even if the first draft code isn't amazing, you can just quality gate with an LLM reviewer that has been instructed to be brutal and do a manual pass when the code gets by the LLM reviewer.

LtWorf 2 days ago | parent [-]

Because juniors learn while LLMs don't and you must explain the same thing over and over forever.

adastra22 2 days ago | parent [-]

If you are explaining things more than once, you are doing it wrong. Which is not on you as the tools currently suck big time. But it is quite possible to have LLM agents “learn” by intelligently matching context (including historical lessons learned) to conversation.

paulcole 2 days ago | parent | prev | next [-]

> Ask for a plan but ask it to implement the plan in small steps and ask the model to test each step before starting the next.

Tried this on a developer I worked with once and he just scoffed at me and pushed to prod on a Friday.

NitpickLawyer 2 days ago | parent [-]

> scoffed at me and pushed to prod on a Friday.

that's the --yolo flag in cc :D

rvnx 2 days ago | parent | prev | next [-]

Your tips are perfect.

Most users will just give a vague tasks like: "write a clone of Steam" or "create a rocket" and then they blame Claude Code.

If you want AI to code for you, you have to decompose your problem like a product owner would do. You can get helped by AI as well, but you should have a plan and specifications.

Once your plan is ready, you have to decompose the problem into different modules, then make sure each modules are tested.

The issue is often with the user, not the tool, as they have to learn how to use the tool first.

wordofx 2 days ago | parent [-]

> Most users will just give a vague tasks like: "write a clone of Steam" or "create a rocket" and then they blame Claude Code.

This seems like half of HN with how much HN hates AI. Those who hate it or say it’s not useful to them seem to be fighting against it and not wanting to learn how to use it. I still haven’t seen good examples of it not working even with obscure languages or proprietary stuff.

drzaiusx11 2 days ago | parent | next [-]

Anyone who has mentored as part of a junior engineer internship program AND has attempted to use current gen ai tooling will notice the parallels immediately. There are key differences though that are worth highlighting.

The main difference is that with the current batch of genai tools, the AI's context resets after use, whereas a (good) intern truly learns from prior behavior.

Additionally, as you point out, the language and frameworks need to be part of the training set since AI isn't really "learning" it's just prepolulating a context window for its pre-existing knowledge (token prediction), so ymmv depending on hidden variables from the secret (to you, the consumers) training data and weights. I use Ruby primarily these days, which is solidly in the "boring tech" camp and most AIs fail to produce useful output that isn't rails boilerplate.

If I did all my IC contributions via directed intern commits I'd leave the industry out of frustration. Using only AI outputs for producing code changes would be akin to torture (personally.)

Edit: To clarify I'm not against AI use, I'm just stating that with the current generation of tools it is a pretty lackluster experience when it comes to net new code generation. It excells at one off throwaway scripts and making large tedious redactors less drudgerly. I wouldn't pivot to it being my primary method of code generation until some of the more blatant productiviy losses are addressed.

hn_acc1 2 days ago | parent | prev | next [-]

When it's best suggestion (for inline typing) is bring back a one-off experiment in a different git worktree from 3 months ago that I only needed that one time.. it does make me wonder.

Now, it's not always useless. It's GREAT at adding debugging output and knowing which variables I just added and thus want to add to the debugging output. And that does save me time.

And it does surprise me sometimes with how well it picks up on my thinking and makes a good suggestion.

But I can honestly only accept maybe 15-20% of the suggestions it makes - the rest are often totally different from what I'm working on / trying to do.

And it's C++. But we have a very custom library to do user-space context switching, and everything is built on that.

halfcat 2 days ago | parent | prev | next [-]

> not wanting to learn how to use it

I kind of feel this. I’ll code for days and forget to eat or shower. I love it. Using Claude code is oddly unsatisfying to me. Probably a different skillset, one that doesn’t hit my obsessive tendencies for whatever reason.

I could see being obsessed with some future flavor of it, and I think it would be some change with the interface, something more visual (gamified?). Not low-code per se, but some kind of mashup of current functionality with graph database visualization (not just node force graphs, something more functional but more ergonomic). I haven’t seen anything that does this well, yet.

LtWorf 2 days ago | parent | prev [-]

If you have to iterate 10 times, that is "not working", since it already wasted way more time than doing it manually to begin with.

ccorcos 2 days ago | parent | prev | next [-]

Seems like this logic could all be represented in Claude.md and some agents. Has anyone done this? I’d love to just import that into my project because I’m using some of these tactics but it’s fairly manual and tedious.

2 days ago | parent | prev | next [-]
[deleted]
2 days ago | parent | prev | next [-]
[deleted]
biggc 2 days ago | parent | prev | next [-]

Thin sounds a lot like making a change yourself.

therein 2 days ago | parent [-]

It appeals to some people because they'd rather manage a bot and get it to do something they told it to do rather than do it themselves.

rmonvfer 2 days ago | parent | prev | next [-]

I’d like to add: keep some kind of development documentation where you describe in detail the patterns and architecture of your application and it’s components.

I’ve seen incredible improvements just by doing this and using precise prompting to get Claude to implement full services by itself, tests included. Of course it requires manual correction later but just telling Claude to check the development documentation before starting work on a feature prevents most hallucinations (that and telling it to use the Context7 MCP for external documentation), at least in my experience.

The downside to this is that 30% of your context window will be filled with documentation but hey, at least it won’t hallucinate API methods or completely forget that it shouldn’t reimplement something.

Just my 2 cents.

salty_frog 2 days ago | parent | prev | next [-]

This is my algorithm for wetware llms.

whateveracct 2 days ago | parent | prev | next [-]

that sounds like just coding it yourself with extra steps

baq 2 days ago | parent [-]

Exactly, then you launch ten copies of yourself and write code to manage that yourself, maybe.

renegat0x0 2 days ago | parent | prev [-]

Huh, I thought that AI was made to be magic. Click and it generates code. Turns out it is like magic, but you are an apprentice, and still have to learn how to wield it.

dotancohen 2 days ago | parent [-]

All sufficiently advanced technology...