the hiding stuff is weird because the whole reason you'd want to see what Claude is doing isn't just curiosity - it's about catching when it goes off the rails before it makes a mess. like when it starts reading through your entire codebase because it misunderstood what you asked for, or when it's about to modify files you didn't want touched. the verbose mode fix is good but honestly this should've been obvious from the start - if you're letting an AI touch your files, you want to know exactly which files. not because you don't trust the tool in theory but because you need to verify it's doing what you actually meant, not what it thinks you meant. abstractions are great until they hide the thing that's about to break your build

▲ rco8786 5 hours ago | parent | next [-]

> it's about catching when it goes off the rails before it makes a mess

The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.

Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.

That said, I agree with you in the sense that the "going off the rails" problem is very much not solved even on the latest models. It's not clear to me how we can trust a team of AI agents working autonomously to actually build the right thing.

▲ g947o 4 hours ago | parent | next [-]

None of those wild experiments are running on a "real", existing codebase that is more than 6 months old. The thing they don't talk about is that nobody outside these AI companies wants to vibe code with a 10 year old codebase with 2000 enterprise customers.

As you as you start to work with a codebase that you care about and need to seriously maintain, you'll see what a mess these agents make.

▲ GoatInGrey 33 minutes ago | parent | next [-]

Even on codebases within the half-year age group, these LLMs often do perform nasty (read: ungodly verbose) implementations that become a maintainability nightmare. Even for the LLMs that wrote it all in the first place. I know this because we've had a steady trickle of clients and prospects expressing "challenges around maintainability and scalability" as they move toward "production readiness". Of course, asking if we can implement "better performing coding agents". As if improved harnessing or similar guardrails can solve what is in my view, a deeper problem.

The practical and opportunistic response is too tell them "Tough cookies" and watch the problems steadily compound into more lucrative revenue opportunities for us. I really have no remorse for these people. Because half of them were explicitly warned against this approach upfront but were psychologically incapable of adjusting expectations or delaying LLM deployment until the technology proved itself. If you've ever had your professional opinion dismissed by the same people regarding you as the SME, you understand my pain.

I suppose I'm just venting now. While we are now extracting money from the dumbassery, the client entitlement and management of their emotions that often comes with putting out these fires never makes for a good time.

▲ krastanov 4 hours ago | parent | prev | next [-]

I maintain serious code bases and I use LLM agents (and agent teams) plenty -- I just happen to review the code they write, I demand they write the code in a reviewable way, and use them mostly for menial tasks that are otherwise unpleasant timesinks I have to do myself. There are many people like me, that just quietly use these tools to automate the boring chores of dealing with mature production code bases. We are quiet because this is boring day-to-day work.

E.g. I use these tools to clean up or reorganize old tests (with coverage and diff viewers checking of things I might miss), update documentation with cross links (with documentation linters checking for errors I miss), convert tests into benchmarks running as part of CI, make log file visualizers, and many more.

These tools are amazing for dealing with the long tail of boring issues that you never get to, and when used in this fashion they actually abruptly increase the quality of the codebase.

▲

g947o 2 hours ago | parent | next [-]

It's not called vibe coding then.

▲

jmalicki 2 hours ago | parent [-]

Oh you made vibe coding work? Well then it's not vibe coding.

But any time someone mentions using AI without proof of success? Vibe coding sucks.

▲

GoatInGrey 28 minutes ago | parent | next [-]

No, what the other commenter described is narrowly scoped delegation to LLMs paired with manual review (which sounds dreadfully soul-sucking to me), not wholesale "write feature X, write the unit tests, and review the implementation for me". The latter is vibe-coding.

	▲	unshavedyak 3 minutes ago \| parent [-]
		Sidenote, i do that frequently. I also do varying levels of review, ie more/less vibe[1]. It is soul sucking to me. Despite being soul sucking, I do it because A: It lets me achieve goals despite lacking energy/time for projects that don't require the level of commitment or care that i provide professionally. B: it reduces how much RSI i experience. Typing is a serious concern for me these days. To mitigate the soul sucking i've been side projecting better review tools. Which frankly i could use for work anyway, as reviewing PRs from humans could be better too. Also inline with review tools, i think a lot of soul sucking is having to provide specificity, so i hope to be able to integrate LLMs into the review tool and speak more naturally to it. Eg i belive some IDEs (vscode? no idea) can let Claude/etc see the cursor, so you can say "this code looks incorrect" without needing to be extremely specific. A suite of tooling that improves this code sharing to Claude/etc would also reduce the inane specificity that seems to be required to make LLMs even remotely reliable for me. [1]: though we don't seem to have a term for varying amounts of vibe. Some people consider vibe to be 100% complete ignorance of the architecture/code being built. In which case imo nothing i do is vibe, which is absurd to me but i digress.

▲

lukeschlather 36 minutes ago | parent | prev | next [-]

It's not vibe coding if you personally review all the diffs for correctness.

▲

EnPissant an hour ago | parent | prev | next [-]

> According to Karpathy, vibe coding typically involves accepting AI-generated code without closely reviewing its internal structure, instead relying on results and follow-up prompts to guide changes.

What you are doing is by definition not vibe coding.

▲

dingnuts an hour ago | parent | prev [-]

[dead]

▲

peyton 4 hours ago | parent | prev [-]

Yeah esp. the latest iterations are great for stuff like “find and fix all the battery drainers.” Tests pass, everyone’s happy.

	▲	hp197 2 hours ago \| parent [-]
		(rhetorical question) You work at Apple? :p

▲ JPKab 4 hours ago | parent | prev | next [-]

I work at a company with approximately $1 million in revenue per engineer and multiple 10+ year old codebases.

We use agents very aggressively, combined with beads, tons of tests, etc.

You treat them like any developer, and review the code in PRs, provide feedback, have the agents act, and merge when it's good.

We have gained tremendous velocity and have been able to tackle far more out of the backlog that we'd been forced to keep in the icebox before.

This idea of setting the bar at "agents work without code reviews" is nuts.

▲ groundzeros2015 3 hours ago | parent [-]

Why are you using experience and authoritative framing about a technology we’ve been using for less than 6 months?

▲ kasey_junk 3 hours ago | parent | next [-]

The person they are responding with dictated an authoritative framing that isn’t true.

I know people have emotional responses to this, but if you think people aren’t effectively using agents to ship code in lots of domains, including existing legacy code bases, you are incorrect.

Do we know exactly how to do that well, of course not, we still fruitlessly argue about how humans should write software. But there is a growing body of techniques on how to do agent first development, and a lot of those techniques are naturally converging because they work.

▲ groundzeros2015 3 hours ago | parent [-]

I think programming effectiveness is inherently tied to the useful life of software, and we will need to see that play out.

This is not to suggest that AI tools do not have value but that “I just have agents writing code and it works great!” Has yet to hit its test.

▲ garciasn 2 hours ago | parent [-]

The views I see often shared here are typical of those in the trenches of the tech industry: conservative.

I get it; I do. It's rapidly challenging the paradigm that we've setup over the years in a way that it's incredibly jarring, but this is going to be our new reality or you're going to be left behind in MOST industries; highly regulated industries are a different beast.

So; instead of just out-of-hand dismissing this, figure out the best ways to integrate agents into your and your teams'/companies' workstreams. It will accelerate the work and change your role from what it is today to something different; something that takes time and experience to work with.

	▲	benterix 2 hours ago \| parent \| next [-]
		> I get it; I do. It's rapidly challenging the paradigm that we've setup over the years in a way that it's incredibly jarring, But it's not the argument. The argument is that these tools provide lower-quality output and checking this output often takes more time than doing this work oneself. It's not that "we're conservative and afraid of changes", heck, you're talking to a crowd that used to celebrate a new JS framework every week! There is a push to accept lower quality and to treat it as a new normal, and people who appreciate high-quality architecture and code express their concern.
	▲	thesz 2 hours ago \| parent \| prev \| next [-]
		`> It will accelerate the work and change your role from what it is today to something different;` We yet to see if different is good. My short experience with LLM reviewing my code is that LLM's output is overly explanatory and it slows me down. `> something that takes time and experience to work with.` So you invite us to participate in sunken cost fallacy.
	▲	groundzeros2015 2 hours ago \| parent \| prev [-]
		I don’t doubt that companies are willing to try low quality things. They play with these processes all the time. Maybe the whole industry will try it. I’m available for consulting when you need something done correctly.

▲ JPKab 2 hours ago | parent | prev | next [-]

6 months?

I've been using LLMs to augment development since early December 2023. I've expanded the scope and complexity of the changes made since then as the models grew. Before beads existed, I used a folder of markdown files for externalized memory.

Just because you were late to the party doesn't mean all of us were.

	▲	2 hours ago \| parent \| next [-]
		[deleted]
	▲	2 hours ago \| parent \| prev [-]
		[deleted]

▲ dboreham 2 hours ago | parent | prev [-]

If you hired a person six months ago and in that time they'd produced a ton of useful code for your product, wouldn't you say with authoritative framing that their hiring was a good decision?

	▲	groundzeros2015 2 hours ago \| parent [-]
		It would, but I haven’t seen that. What I’ve seen is a lot of people setting up cool agent workflows which feel very productive, but aren’t producing coherent work. This may be a result of me using tools poorly, or more likely evaluating merits which matter less than I think. But I don’t think we can see that yet as people just invented these agent workflows and we haven’t seen it yet. Note that the situation was not that different before LLMs. I’ve seen PMs with all the tickets setup, engineers making PRs with reviews, etc and not making progress on the product. The process can be emulated without substantive work.

▲ rco8786 4 hours ago | parent | prev | next [-]

That is also my experience. Doesn't even have to be a 10 year old codebase. Even a 1 year old codebase. Any one that is a serious product that is deployed in production with customers who rely on it.

Not to say that there's no value in AI written code in these codebases, because there is plenty. But this whole thing where 6 agents run overnight and "tada" in the morning with production ready code is...not real.

	▲	zerkten 4 hours ago \| parent [-]
		I don't believe that devs are the audience. They are pushing this to decision makers where they want them to think that the state of the art is further ahead than it is. These folks then think about how helpful it'd be to have 20% of that capability. When there is so much noise in the market, and everyone seems to be overtaking everyone else it, this kind of approach is the only one that gets attention. Similarly, a lot of the AGI-hype comments exist to expand the scope of the space. It's not real, but it helps to position products and win arguments based on hypotheticals.

▲ pjc50 3 hours ago | parent | prev | next [-]

Also anything that doesn't look like a SaaS app does very badly. We had an internal trial at embedded firmware and concluded the results were unsalvageably bad. It doesn't help that the embedded environment is very unfriendly to standard testing techniques, as well.

▲ JeremyNT 2 hours ago | parent | prev | next [-]

I feel like you could have correctly stated this a few months ago, but the way this is "solved" is by multiple agents that babysit each other and review their output - it's unreasonably effective.

You can get extremely good results assuming your spec is actually correct (and you're willing to chew through massive quantities of tokens / wait long enough).

▲

ldng 2 hours ago | parent [-]

And unreasonably expensive unless you are Big Corp. Die startups, die. Welcome to our Cyberpunk overlords.

	▲	whateveracct 35 minutes ago \| parent [-]
		Companies will just shift money from salaries to their Anthropic bill - what's the problem?

▲ 3 hours ago | parent | prev [-]

[deleted]

▲ pzs 3 hours ago | parent | prev | next [-]

Related question: how do we resolve the problem that we sign a blank cheque for the autonomous agents to use however many tokens they deem necessary to respond to your request? The analogy from team management: you don't just ask someone in your team to look into something only to realize three weeks later (in the absence of any updates) that they got nowhere with a problem that you expected to take less than a day to solve.

EDIT: fixed typo

▲

pjc50 3 hours ago | parent | next [-]

> blank cheque

The Bing AI summary tells me that AI companies invested $202.3 billion in AI last year. Users are going to have to pay that back at some point. This is going to be even worse as a cost control situation than AWS.

▲

thephyber 3 hours ago | parent [-]

> Users are going to have to pay that back at some point.

That’s not how VC investments work. Just because something costs a lot to build doesn’t mean that anyone will pay for it. I’m pretty sure I haven’t worked for any startup that ever returned a profit to its investors.

I suspect you are right in that inference costs currently seem underpriced so users will get nickel-and-dinked of a while until the providers leverage a better margin per user.

Some of the players are aiming for AGI. If they hit that goal, the cost is easily worth it. The remaining players are trying to capture market share and build a moat where none currently exists.

	▲	tsunamifury 2 hours ago \| parent [-]
		What planet are you living on and how do I get there. Yes currency is very rarely at times exchanged at a loss for power but rarely not for more currency down the road.

▲

rco8786 3 hours ago | parent | prev | next [-]

We'll have to solve for that sometime soon-ish I think. Claude Code has at least some sort of token estimation built-in to it now. I asked it to kick off a large agent team (~100 agents) to rewrite a bunch of SQL queries, one per agent. It did the first 10 or so, then reported back that it would cost too much to do it this way...so it "took the reins" without my permission and tried to convert each query using only the main agent and abandoned the teams. The results were bad.

But in any case, we're definitely coming up on the need for that.

▲

Kye 3 hours ago | parent | prev | next [-]

An AI product manager agent trained on all the experience of product managers setting budgets for features and holding teams to it. Am I joking? I do not know.

▲

peab 3 hours ago | parent | prev | next [-]

This seems pretty in line with how you’d manage a human - you give it a time constraint. a human isn't guaranteed to fix a problem either, and humans are paid by time

▲

3 hours ago | parent | prev [-]

[deleted]

▲ the_harpia_io 3 hours ago | parent | prev | next [-]

yeah I think that's exactly the disconnect - they're optimizing for a future where agents can actually be trusted to run autonomously, but we're not there yet. like the reliability just isn't good enough to justify hiding what it's doing. and honestly I'm not sure we'll get there by making the UX worse for humans who are actively supervising, because that's how you catch the edge cases that training data misses. idk, feels like they're solving tomorrow's problem while making today's harder

	▲	2 hours ago \| parent [-]
		[deleted]

▲ simianwords 3 hours ago | parent | prev | next [-]

>The latest "meta" in AI programming appears to be agent teams (or swarms or clusters or whatever) that are designed to run for long periods of time autonomously.

more reason to catch them otherwise we have to wait a longer time. in fact hiding is more correct if the AI was less autonomous right?

▲ KurSix 2 hours ago | parent | prev | next [-]

If they're aiming for autonomy, why have a CLI at all? Just give us a headless mode. If I'm sitting in the terminal, it means I want to control the process. Hiding logs from an operator who’s explicitly chosen to run it manually just feels weird

▲ faeyanpiraat 4 hours ago | parent | prev | next [-]

Looking at it from far is simply making something large from a smaller input, so its kind of like nondeterministic decompression.

What fills the holes are best practices, what can ruin the result is wrong assumptions.

I dont see how full autonomy can work either without checkpoints along the way.

▲ rco8786 4 hours ago | parent [-]

Totally agreed. Those assumptions often compound as well. So the AI makes one wrong decision early in the process and it affects N downstream assumptions. When they finally finish their process they've built the wrong thing. This happens with one process running. Even on latest Opus models I have to babysit and correct and redirect claude code constantly. There's zero chance that 5 claude codes running for hours without my input are going to build the thing I actually need.

And at the end of the day it's not the agents who are accountable for the code running in the production. It's the human engineers.

▲ adastra22 4 hours ago | parent | next [-]

Actually it works the other way. With multiple agents they can often correct each others mistaken assumptions. Part of the value of this approach is precisely that you do get better results with fewer hallucinated assumptions.

Still makes this change from Anthropic stupid.

▲ rco8786 3 hours ago | parent | next [-]

The corrective agent has the exact same percentage chance at making the mistake. "Correcting" an assumption that was previously correct into an incorrect one.

If a singular agent has a 1% chance of making an incorrect assumption, then 10 agents have that same 1% chance in aggregate.

▲ adastra22 2 hours ago | parent [-]

You are assuming statistical independence, which is explicitly not correct here. There is also an error in your analysis - what matters is whether they make the same wrong assumption. That is far less likely, and becomes exponentially unlikely with increasing trials.

I can attest that it works well in practice, and my organization is already deploying this technique internally.

▲ thesz an hour ago | parent [-]

How several wrong assumptions make it right with increasing trials?

▲ adastra22 an hour ago | parent [-]

You can ask Opus 4.6 to do a task and leave it running for 30min or more to attempt one-shooting it. Imagine doing this with three agents in parallel in three separate work trees. Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one. If no consensus after N runs, reframe to provide directions for a 4th attempt. Continue until a clear winning approach is found.

This is one example of an orchestration workflow. There are others.

	▲	thesz 21 minutes ago \| parent [-]
		`> Then spin up a new agent to decide which approach of the three is best on the merits. Repeat this analysis in fresh contexts and sample until there is clear consensus on one.` If there are several agents doing analysis of solutions, how do you define a consensus? Should it be unanimous or above some threshold? Are agents scores soft or hard? How threshold is defined if scores are soft? There is a whole lot of science in voting approaches, which voting approach is best here? Is it possible for analyzing agents to choose the best of wrong solutions? E.g., longest remembered table of FizzBuzz answers amongst remembered tables of FizzBuzz answers.

▲ groundzeros2015 3 hours ago | parent | prev [-]

Nonsense. If you have 16 binary decisions that’s 64k possible paths.

▲

adastra22 2 hours ago | parent [-]

These are not independent samplings.

▲

groundzeros2015 2 hours ago | parent [-]

Indeed. Doesn’t that make it worse? Prior decisions will bring up path dependent options ensuring they aren’t even close to the same path.

	▲	adastra22 an hour ago \| parent [-]
		Run a code review agent, and ask it to identify issues. For each issue, run multiple independent agents to perform independent verification of this issue. There will always be some that concur and some that disagree. But the probability distributions are vastly different for real issues vs hallucinations. If it is a real issue they are more likely to happen upon it. If it is a hallucination, they are more likely to discover the inconsistency on fresh examination. This is NOT the same as asking “are you sure?” The sycophantic nature of LLMs would make them biased on that. But fresh agents with unbiased, detached framing in the prompt will show behavior that is probabilistically consistent with the underlying truth. Consistent enough for teasing out signal from noise with agent orchestration.

▲ peyton 4 hours ago | parent | prev [-]

Take a look at the latest Codex on very-high. Claude’s astroturfed IMHO.

	▲	rco8786 3 hours ago \| parent [-]
		Can you explain more? I'm talking about LLM/agent behavior in a generalized sense, even though I used claude code as the example here. What is Codex doing differently to solve for this problem?

▲ 4 hours ago | parent | prev | next [-]

[deleted]

▲ logicchains 2 hours ago | parent | prev [-]

>Through that lens, these changes make more sense. They're not designing UX for a human sitting there watching the agent work. They're designing for horizontally scaling agents that work in uninterrupted stretches where the only thing that matters is the final output, not the steps it took to get there.

Even in that case they should still be logging what they're doing for later investigation/auditing if something goes wrong. Regardless of whether a human or an AI ends up doing the auditing.

▲ hippo22 21 minutes ago | parent | prev | next [-]

You look at what Claude’s doing to make sure it doesn’t go off the rails? Personally, I either move on to another ask in parallel or just read my phone. Trying to catch things by manually looking at its output doesn’t seem like a recipe for success.

▲ xnorswap 4 hours ago | parent | prev | next [-]

Yes, this is why I generally still use "ask for permission" prompts.

As tedious as it is a lot of the time ( And I wish there was an in-between "allow this session" not just allow once or "allow all" ), it's invaluable to catch when the model has tried to fix the problem in entirely the wrong project.

Working on a monolithic code-base with several hundred library projects, it's essential that it doesn't start digging in the wrong place.

It's better than it used to be, but the failure mode for going wrong can be extreme, I've come back to 20+ minutes of it going around in circles frustrating itself because of a wrong meaning ascribed to an instruction.

	▲	lachlan_gray 18 minutes ago \| parent \| next [-]
		fwiw there are more granular controls, where you can for example allow/deny specific bash commands, read or write access to specific files, using a glob syntax: https://code.claude.com/docs/en/settings#permission-settings You can configure it at the project level
	▲	the_harpia_io 3 hours ago \| parent \| prev [-]
		oh man the going-in-circles thing - that's the worst because you don't even know how long to let it run before you realize it's stuck. I've had similar issues where it misunderstands scope and starts making changes that cascade in ways it can't track. the 'allow this session' idea is actually really good - would be useful to have more granular control like that. honestly this is why I end up breaking work into smaller chunks and doing more prompt-response cycles rather than letting it run autonomously, but that obviously defeats the purpose of having an agent do the work

▲ aceelric 4 hours ago | parent | prev | next [-]

Exactly, and this is the best way to do code review while it's working so that you can steer it better. It's really weird that Anthropic doesn't get this.

	▲	the_harpia_io 3 hours ago \| parent [-]
		yeah the steering thing is huge - like when you can see it's about to go down the wrong path you can interrupt before it wastes time. or when you realize your prompt wasn't clear enough, you catch it early. hiding all that just means you find out later when it's already done the wrong thing, and then you're stuck trying to figure out what went wrong and how to fix it. it's the difference between collaborative coding and just hoping the black box does what you want

▲ charliea0 4 hours ago | parent | prev | next [-]

I assume it's to make it harder for competitors to train on Claude's Chain-of-Thought.

	▲	the_harpia_io 3 hours ago \| parent \| next [-]
		hm, maybe? but seems like a weird tradeoff to make UX actively worse just for that - especially since the thinking tokens are still there in the API responses anyway, just hidden in the UI. I think it's more likely they're just betting most people don't care about the intermediate steps and want faster responses, which tbh tracks with how most product teams think about 'simplification'. though it does feel short-sighted when the whole point of using these tools is trust and verification
	▲	simianwords an hour ago \| parent \| prev [-]
		Not true because it is still exposed in API

▲ faeyanpiraat 4 hours ago | parent | prev [-]

The other side of catcing going off the rails is when it wants to make edits without it reading the context I know would’ve been neccessary for a high quality change.

	▲	the_harpia_io 3 hours ago \| parent [-]
		yeah exactly - it's that confidence without understanding. like it'll make a change that looks reasonable in isolation but breaks an assumption that's only documented three files over, or relies on state that's set up elsewhere. and you can't always tell just from looking at the diff whether it actually understood the full picture or just got lucky. this is why seeing what files it's reading before it makes changes would be super helpful - at least you'd know if it missed something obvious