Claude built a system in 3 rounds, latent bugs from round 1 exploded in round 3

flowerthoughts an hour ago | parent | next [-]

I've used the analogy of a circular saw before ("it's not really sawing, because you can't feel the wood...")

It's easy to just slab on a Skil saw, but through the beam and it'll be somewhat straight. But when every manual stroke counts, there's enough time on a human time scale to correct every little mistake. It's definitely possible to become skilled at using the circular saw, but it takes effort that it feels like you don't need at first.

This is similar. LLMs are so powerful for writing code that it's easy to become complacent and forget your role as the engineer using the tool: guaranteeing correctness, security, safety and performance of the end result. When you're not invested in every if-statement, forgetting to check edge cases is really easy to do. And as much as I like Claude writing test cases for me, I also have to ensure the coverage is decent, that the implicit assumptions made about external library code is correct, etc. It takes a lot of effort to do it right. I don't know why Mycelium thinks they invented interfaces for module boundaries, but I'm pretty sure they are still as suceptible to that "0" not behaving as you'd expect, or the empty string being interpreted as "missing." Or the CSG algorithm working, except if your hole edges are incident with some boundary edges.

	▲	zihotki 25 minutes ago \| parent [-]
		Your analogy with a Skil saw is genius! You can cut much faster but it's also much more dangerous. Just like the AI indeed.

▲

reedf1 2 hours ago | parent | prev | next [-]

It's starting to become obvious that if you can't effectively use AI to build systems it is a skill issue. At the moment it is a mysterious, occasionally fickle, tool - but if you provide the correct feedback mechanisms and provide small tweaks and context at idiosyncrasies, it's possible to get agents to reliably build very complex systems.

▲

ptak_dev 4 minutes ago | parent | next [-]

Partially agree, but I think "skill issue" undersells the genuine reliability problem the original post is describing.

The skill part is real — giving the agent the right context, breaking tasks into the right size, knowing when to intervene. Most people aren't doing that well and their results reflect it.

But the latent bug problem isn't really a skill issue. It's a property of how these systems work: the agent optimises for making the current test pass, not for building something that stays correct as requirements change. Round 1 decisions get baked in as assumptions that round 3 never questions — and no amount of better prompting fixes that.

The fix isn't better prompting. It's treating agent-generated code with the same scepticism you'd apply to code from a contractor who won't be around to maintain it — more tests, explicit invariants, and not letting the agent touch the architecture without a human reviewing the design first.

▲

bartread 7 minutes ago | parent | prev | next [-]

> It's starting to become obvious that if you can't effectively use AI to build systems it is a skill issue.

I think it's fair to say that you can get a long way with Claude very quickly if you're an individual or part of a very small team working on a greenfield project. Certainly at project sizes up to around 100k lines of code, it's pretty great.

But I've been working startups off and on since 2024.

My last "big" job was with a company that had a codebase well into the millions of lines of code. And whilst I keep in contact with a bunch of the team there, and I know they do use Claude and other similar tools, I don't get the vibe it's having quite the same impact. And these are very talented engineers, so I don't think it's a skill either.

I think it's entirely possible that Claude is a great tool for bootstrapping and/or for solo devs or very small teams, but becomes considerably less effective when scaled across very large codebases, multiple teams, etc.

For me, on that last point, the jury is out. Hopefully the company I'm working with now grows to a point where that becomes a problem I need to worry about but, in the meantime, Claude is doing great for us.

▲

Zafira 2 hours ago | parent | prev | next [-]

> At the moment it is a mysterious, occasionally fickle, tool - but if you provide the correct feedback mechanisms and provide small tweaks and context at idiosyncrasies, it's possible to get agents to reliably build very complex.

This sounds like arguing you can use these models to beat a game of whack-a-mole if you just know all the unknown unknowns and prompt it correctly about them.

This is an assertion that is impossible to prove or disprove.

▲

rafaelmn an hour ago | parent | next [-]

No it's more like if you knew how to build it before - LLM agents help you build it faster. There's really no useful analogy I can think of, but it fits my current role perfectly because my work is constantly interrupted by prod support, coordination, planning, context switching between issues etc.

I rarely have blocks of "flow time" to do focused work. With LLMs I can keep progressing in parallel and then when I get to the block of time where I can actually dive deep it's review and guidance again - focus on high impact stuff instead of the noise.

I don't think I'm any faster with this than my theoretical speed (LLMs spend a lot of time rebuilding context between steps, I have a feeling current level of agents is terrible at maintaining context for larger tasks, and also I'm guessing the model context length is white a lie - they might support working with 100k tokens but agents keep reloading stuff to context because old stuff is ignored).

In practice I can get more done because I can get into the flow and back onto the task a lot faster. Will see how this pans out long term, but in current role I don't think there are alternatives, my performance would be shit otherwise.

▲

dkdbejwi383 an hour ago | parent [-]

You could probably replace LLM with "junior engineer" here as it sounds like you're basically a manager now. The big negative that LLMs have in comparison with junior engineers is that they can't learn and internalise new information based on feedback.

	▲	lukan an hour ago \| parent [-]
		"The big negative that LLMs have in comparison with junior engineers is that they can't learn and internalise new information based on feedback." No, but they can take "notes" and can load those notes into context. That does work, but is of course not so easy as it is with humans. It is all about cleaning up and maintaining a tidy context.

▲

threethirtytwo 20 minutes ago | parent | prev | next [-]

>This is an assertion that is impossible to prove or disprove.

This is a joke right? There are complex systems that exist today that are built exclusively via AI. Is that not obvious?

The existence of such complex systems IS proof. I don't understand how people walk around claiming there's no proof? Really?

▲

reedf1 43 minutes ago | parent | prev [-]

The same is true with human engineers - isn't this just what engineering is?

▲

zihotki 27 minutes ago | parent | prev | next [-]

According to the https://blog.katanaquant.com/p/your-llm-doesnt-write-correct... previously discussed on HN, it may be at least partially true:

> The vibes are not enough. Define what correct means. Then measure.

▲

Grimblewald 2 hours ago | parent | prev | next [-]

Say i buy into your mysticism based take, is it a useful tool if it blows up in damn well near every professionals face?

lets say i accept you and you alone have the deep majiks required to use this tool correctly, when major platform devs could not so far, what makes this tool useful? Billions of dollars and environment ruining levels of worth it?

I'd say the only real use for these tools to date has been mass surveillance, and sometimes semi useful boilerplate.

▲

mikkupikku 3 minutes ago | parent | next [-]

> is it a useful tool if it blows up in damn well near every professionals face

It doesn't, that's ego-preserving cope. Saying that this stuff doesn't work for "damn well near every professional" because it doesn't work for you is like a thief saying "Everybody else steals, why are you picking on me"? It's not true, it's something you believe to protect your own self-image.

▲

reedf1 37 minutes ago | parent | prev [-]

Honestly people are in such a weird place with this shit. I'm not saying don't read the fucking code - but I managed to get my setup to write 100k lines of indistinguishable SWE code in a week or so. The main limitation was my reading speed. This is something like a 10x speedup for me.

	▲	Grimblewald 19 minutes ago \| parent [-]
		How does one verify 100k lines in a week? Let alone evaluate it to being SWE equivallent? That's super human. I like to think I am pretty good at what I do, but really critically engaging with 100k lines in a week is beyond even 10 of me. Forgive my skepticism, but I'm going to hazard the guess that you don't know what the fuck you're doing. You've lost your goddamn mind if you think you're doing anything other than skim read at a rate of 42 lines a minute for your entire work day without a break.

▲

mikkupikku 8 minutes ago | parent | prev | next [-]

Truth Nuke

▲

croes an hour ago | parent | prev | next [-]

> if you provide the correct feedback

And how do you define correct feedback? If the output is correct?

	▲	reedf1 44 minutes ago \| parent [-]
		I don't know if you deliberately cut-off the full point, but for the benefit of those with tired eyes I said 'feedback mechanisms', i.e. feedback in the control system sense.

▲

stavros 2 hours ago | parent | prev [-]

I'd agree, I've been building a personal assistant (https://github.com/skorokithakis/stavrobot) and I'm amazed that, for the first time ever, LLMs manage to build reliably, with much fewer bugs than I'd expect from a human, and without the repo devolving to unmaintainability after a few cycles.

It's really amazing, we've crossed a threshold, and I don't know what that means for our jobs.

	▲	Grimblewald 5 minutes ago \| parent [-]
		No bugs means nothing if bugs get hidden and llms are great at hiding bugs and will absolutely fail to find some fairly critical ones. Your own repo, which is slop at best, fails to meet its core premist > Another AI agent. This one is awesome, though, and very secure. it isn't secure. It took me less than three minutes to find a vulnerability. Start engaging with your own code, it isn't as good as you think it is.

▲

karel-3d 2 hours ago | parent | prev | next [-]

I have no idea what am I reading

> Mycelium structures applications as directed graphs of pure data transformations. Each node (cell) has explicit input/output schemas. Cells are developed and tested in complete isolation, then composed into workflows that are validated at compile time. Routing between cells is determined by dispatch predicates defined at the workflow level — handlers compute data, the graph decides where it goes.

No still don't understand

> Mycelium uses Maestro state machines and Malli contracts to define "The Law of the Graph," providing a high-integrity environment where humans architect and AI agents implement.

Nope, still don't

▲

harperlee an hour ago | parent | next [-]

I don't understand why the poster (which is the author) links us to a slop report of a test for their library. It would be much more effective to cover part of this info into the README where we get the context of what they want to achieve (where there is a very clear "Why?" section), and then link to it instead. I have flagged it as AI slop.

	▲	karel-3d an hour ago \| parent [-]
		I don't understand LISP or Clojure, but it seems to be some kind of library for making web services out of LISP, which has some separate components that are somehow well defined. And somehow it's all related to AI. Again I don't know much about Clojure and I am too slow for functional programming in general.

▲

IshKebab an hour ago | parent | prev | next [-]

Yeah it reads very Time Cube...

▲

_flux an hour ago | parent | prev [-]

The top-level README gives a bit better idea. Armed with that the explanation might sound a bit more understandable.

I'm not familiar with the project (or Clojure), but let me try to explain!

> Mycelium structures applications as directed graphs of pure data transformations.

There is a graph that describes how the data flows in the system. `fn(x) -> x + 1` in a hypothetical language would be a node that takes in a value and outputs a value. The graph would then arrange that function to be called as a result of a previous node computing the parameter x for it.

> Each node (cell) has explicit input/output schemas.

Input and output of a node must comply to a defined schema, which I presume is checked at runtime, as Clojure is a dynamically typed language. So functions (aka nodes) have input and output types and presumably they should try to be pure. My guess is there should be nodes dedicated for side effects.

> Cells are developed and tested in complete isolation, then composed into workflows that are validated at compile time.

Sounds like they are pure functions. Workflows are validated at compile time, even if the nodes themselves are in Clojure.

> Routing between cells is determined by dispatch predicates defined at the workflow level — handlers compute data, the graph decides where it goes.

When the graph is built, you don't just need to travel all outgoing edges from a node to the next, but you can place predicates on those edges. The aforementioned nodes do not have these predicates, so I suppose suppose the predicates would be their own small pure-ish functions with the same as input data as a node would get, but their output value is only a boolean.

> Mycelium uses Maestro state machines and

Maestro is a Clojure library for Finite State Machines: https://github.com/yogthos/maestro

> Malli contracts

Malli looks like a parsing/data structure specification EDSL for Clojure: https://github.com/metosin/malli

> to define "The Law of the Graph," providing a high-integrity environment where humans architect and AI agents implement.

Well, beats me. I don't know what is "The Law of the Graph" and Internet doesn't seem to know either. I suppose it tries to say how you can from the processing graph to see that given input data to the ingress of the graph you have high confidence that you will get expected data at the final egress from the graph.

I do think these kind of guardrails can be beneficial for AI agents developing code. I feel that for that application some additional level of redundancy can improve code quality, even if the guards are generated by the AI code to begin with.

▲

NotGMan 2 hours ago | parent | prev [-]

How is that different compared to a human dev that would miss bugs?

▲

duskdozer 44 minutes ago | parent | next [-]

The people using these programs are generating massive amounts of code, and you won't convince me they're actually carefully understanding most of it, if they even read it all. And it bypasses the first verification step where you are actively typing in the code that will be run.

▲

kleiba 31 minutes ago | parent | prev | next [-]

I didn't downvote you, but I've noticed recently a trend on HN that just asking a normal question (possibly to start a discussion) gets downvoted for no apparent reason and without any explanation. Not good etiquette, in my opinion.

	▲	threethirtytwo 26 minutes ago \| parent [-]
		I think it's all the angry people who were wrong about AI. Every week they come on HN and say AI is utter crap and useless, and every week AI becomes more and more part of the developer work flow. I would be pissed off too if I was a hypocrite who was so sure AI was total garbage and was now at the same time needing to use claude on a daily basis. A lot of developers are going through an identity crisis where their skills are becoming more and more useless and they need to attack comments like the above in a desperate but futile attempt to make themselves matter.

▲

Grimblewald an hour ago | parent | prev | next [-]

The ability to plan long term and anticipate flow on errors.

	▲	onion2k 10 minutes ago \| parent [-]
		You chosen AI agent can and will plan as far ahead as you tell it to plan. If you tell it to do things in iteration 1 that might block it in iteration 3 then you should have told it more about where you want to go back in iteration 1.

▲

nomercy400 an hour ago | parent | prev [-]

Yes, this is simply 'technical debt'.

They should try to fix technical debt before going to the next round. Of course Claude can probably also do this.