Remix.run Logo
ketzo 5 hours ago

I think the core idea here is a good one.

But in many agent-skeptical pieces, I keep seeing this specific sentiment that “agent-written code is not production-ready,” and that just feels… wrong!

It’s just completely insane to me to look at the output of Claude code or Codex with frontier models and say “no, nothing that comes out of this can go straight to prod — I need to review every line.”

Yes, there are still issues, and yes, keeping mental context of your codebase’s architecture is critical, but I’m sorry, it just feels borderline archaic to pretend we’re gonna live in a world where these agents have to have a human poring over every single line they commit.

bikelang 4 hours ago | parent | next [-]

Were you not reviewing every line when a human wrote it before it went to prod? I think the output of these tools is about as good as a human would write - which means it needs thorough review if I’m going to be on the hook to resolve its issues at 2AM.

alecbz 4 hours ago | parent | next [-]

Yeah in many places we had two humans with context on every line, and now we're advocating going to zero?

AnimalMuppet 3 hours ago | parent | prev [-]

Maybe that's the distinction. If I write it, you can call me at 2AM. If an AI wrote it, call the AI at 2AM.

Oh, it can't take the phone call and fix the issue? Then I'm reviewing its output before it goes into prod.

bluGill 4 hours ago | parent | prev | next [-]

Maybe in the future humans won't need to pour over every line. However I quickly learn which interns I can trust and which I need to pour over their code - I don't trust AI because it has been wrong too often. I'm not saying AI is useless - I do most of my coding with an agent, but I don't trust it until I verify every line.

bensyverson 4 hours ago | parent [-]

I did this for a while… and until Opus 4.5, I couldn't fully trust the model. But at this point, while it does make the occasional mistake, I don't need to scrutinize every line. Unit and integration tests catch the bugs we can imagine, and the bugs we can't imagine take us by surprise, which is how it has always been.

bluGill 3 hours ago | parent [-]

Even with 4.6 I find there are a lot of mistakes it makes that I won't allow. Though it is also really good at finding complex thread issues that would take me forever...

pixl97 4 hours ago | parent | prev | next [-]

We live in a world where every line of code written by a human should be reviewed by another human. We can't even do that! Nothing should go straight to prod ever, ever ever, ever.

latchkey 4 hours ago | parent | next [-]

> Nothing should go straight to prod ever, ever ever, ever.

I'm one-shotting AI code for my website without even looking at it. Straight to prod (well, github->cf worker). It is glorious.

Vegenoid 3 hours ago | parent | next [-]

Prod in this context doesn't refer to one person's website for their personal project. It refers to an environment where downtime has consequences, generally one that multiple people work on and that many people rely on.

latchkey 3 hours ago | parent | next [-]

It is not a personal project.

rkomorn 3 hours ago | parent | prev [-]

This is a bit of a no true Scotsman take but I agree with it anyway.

jon-wood 4 hours ago | parent | prev | next [-]

There's a middle ground here. Code for your website? Sure, whatever, I assume you're not Dell and the cost of your website being unavailable to some subset of users for a minute doesn't have 5 zeroes on the end of it. If you're writing code being used by something that matters though you better be getting that stuff reviewed because LLMs can and will make absolutely ridiculous mistakes.

latchkey 4 hours ago | parent [-]

> There's a middle ground here.

I'm responding to this statement: "Nothing should go straight to prod ever, ever ever, ever."

dirkc 4 hours ago | parent | prev | next [-]

It's tough to not interpret this as "I don't care about my website". Do you not check the copy? Or what if AI one-shots something that will harm your reputation in the metadata?

latchkey 3 hours ago | parent [-]

Then I'll read the diffs after the fact and have fix AI it. ¯\_(ツ)_/¯

dirkc 3 hours ago | parent [-]

That sounds better. I assume the stakes are low enough that you are happy reviewing after the fact, but setting up a workflow to check the diffs before pushing to production shouldn't be too difficult

latchkey 3 hours ago | parent [-]

Of course. I could do a PR review process, but what's the point. It is just a static website.

ehsanu1 4 hours ago | parent | prev | next [-]

That a personal website? Prod means different things in different contexts. Even then, I'd be a bit worried about prompt injection unless you control your context closely (no web access etc).

latchkey 4 hours ago | parent [-]

Prompt injection?! Give me an example.

bikelang 4 hours ago | parent | prev [-]

Were people reviewing your hobby projects previously? Were you on-call for your hobby website? If not - then it sounds like nothing changed?

latchkey 4 hours ago | parent [-]

This is my business website.

pixl97 3 hours ago | parent [-]

[Note: It may be very risky to submit anything to this users site]

I'm not sure doing silly things, then advertizing it is a great way to do business, but to each their own.

latchkey 3 hours ago | parent [-]

So many assumptions.

It is a static website hosted on CF workers.

bdangubic 4 hours ago | parent | prev [-]

> Nothing should go straight to prod ever, ever ever, ever

Air Traffic Controller software - sure. 99% of other softwares around that are not mission-critical (like Facebook) just punch it to production - "move fast and break shit" has been cool way before "AI"

alecbz 3 hours ago | parent [-]

There's a lot of software in between Air Traffic Controller and Facebook. And honestly would Meta be okay with Instagram or Facebook going down even for just a few minutes? I'd think at this point that'd be considered a fairly severe incident.

Even if we ignore criticality, things just get really messy and confusing if you push a bunch of broken stuff and only try to start understanding what's actually going on after it's already causing issues.

bdangubic 3 hours ago | parent [-]

> And honestly would Meta be okay with Instagram or Facebook going down even for just a few minutes?

sure, they coined the term “move fast and break things”

and not every “bug” brings the system down, there is bugs after bugs after bugs in both facebook and insta being pushed to production daily, it is fine… it is (almost) always fine. if you are at a place where “deploying to production” is a “thing” you better be at some super mission-critical-lives-at-stake project or you should find another project to work on.

alecbz an hour ago | parent | next [-]

>sure, they coined the term “move fast and break things”

Yeah I'm aware, but as any company gets larger and has more and more traffic (and money) dependent on their existing systems working, keeping those systems working becomes more and more important.

There's lots of things worth protecting to ensure that people keep using your product that fall short of "lives are at stake". Of course it's a spectrum but lots of large enterprises that aren't saving lives but still care a lot about making sure their software keeps running.

pixl97 2 hours ago | parent | prev [-]

> there is bugs after bugs after bugs

These are the bugs after bugs after bugs after bugs after bugs.

Simply put they are going through dev, QA, and UAT first before they are the bugs that we see. When you're running an organization using software of any size writing bugs that takes the software down is extremely easy, data corruption even easier.

bdangubic 2 hours ago | parent [-]

I wholeheartedly agree. I just don't agree with:

> We live in a world where every line of code written by a human should be reviewed by another human. We can't even do that! Nothing should go straight to prod ever, ever ever, ever

Things should 100% go to prod whenever they need to go to prod. While this in theory makes sense, there is insane amount of ceremony in large number of places I have seen personally where it takes an act of congress to deploy to production all the while it is just ceremony, people are hunting other people with links to PR sent to various slack channels "hey anyone available to take a look at this" and then someone is like "I know nothing about that service/system but I'll look at approve." I would wager a high wager that this "we must review every line of code" - where actually implemented - is largely a ceremony. Today I deployed three services to production without anyone looking at what I did. Deploying to production should absolutely be a non-event in places that are ran well and where right people are doing their jobs.

alecbz an hour ago | parent | next [-]

I'm sure some companies do this poorly but there's lots of places where code review happens on every PR and there's processes and systems in place to make sure it's an easy process (or at least, as easy as it should be). Many large tech companies have things pushed to prod automatically many, many times per day and still have code review for all changes going out.

fragmede an hour ago | parent | prev [-]

Even with code review, a well configured CI/CD system is going to include a wealth of automated unit and integration tests, and then also a complex deploy system involving canaries and ramp-up and blue/green deployment and flags and monitoring and alerts that's backed by a pager and on-call rotation with runbooks. Code review simply will never be perfect and catch 100% of issues, so systems are designed with that in mind.

So then then question is what's actually reasonable given today's code generating tools? 0% review seems foolish but 100% seems similarly unreal. Automated code review systems like CodeRabbit are, dare I even say, reasonable as a first line of defense these days. It all comes down too developer velocity balanced with system stability. Error budgets like Google's SRE org is able to enforce against (some) services they support are one way of accomplishing that, but those are hard to put into practice.

So then, as you say, it takes an act of Congress to get anything deployed.

So in the abstract, imo it all comes down to the quality of the automated CI/CD system, and developers being on call for their service so they feel the pain of service unreliability and don't just throw code over the wall. But it's all talk at this level of abstraction. The reality of a given company's office politics and the amount of leverage the platform teams and whatever passes for SRE there have vs the rest of the company make all the difference.

alecbz 4 hours ago | parent | prev | next [-]

How do you know which lines you need to review and which you don't?

Does it feel archaic because LLMs are clearly producing output of a quality that doesn't require any review, or because having to review all the code LLMs produce clips the productivity gains we can squeeze out of them?

layer8 3 hours ago | parent | prev | next [-]

It’s not archaic, it’s due diligence, until we can expect AI to reliably apply the same level of diligence — which we’re still pretty far off from.

manmal an hour ago | parent | prev | next [-]

The article didn't say to read every line though. Just the interesting ones. If you don't know where the interesting ones are, you have already lost.

postexitus 4 hours ago | parent | prev | next [-]

You sound like you are working on unimportant stuff. Sure, go ahead, push.

MrScruff 43 minutes ago | parent [-]

Honestly a lot of useful software is ‘unimportant’ in the sense that the consequences of introducing a bug or bad code smell aren’t that significant, and can be addressed if needed. It might well be for many projects the time saved not reviewing is worth dealing with bugs that escape testing. Also, it’s entirely possible for software to be both well engineered and useless.

bigstrat2003 3 hours ago | parent | prev | next [-]

> It’s just completely insane to me to look at the output of Claude code or Codex with frontier models and say “no, nothing that comes out of this can go straight to prod — I need to review every line.”

It's insane to me that someone can arrive at any other conclusion. LLMs very obviously put out bad code, and you have no idea where it is in their output. So you have to review it all.

SpicyLemonZest 4 hours ago | parent | prev | next [-]

It's a conversation I've had many times in my career and I'm sure I'll have many more. We've got code that seems plausible on a surface level, at a glance it solves the problem it's meant to solve - why can't we just send it to prod and address whatever problems we find with it later?

The answer is that it's very easy for bad code to cause more problems than it solves. This:

> Then one day you turn around and want to add a new feature. But the architecture, which is largely booboos at this point, doesn't allow your army of agents to make the change in a functioning way.

is not a hypothetical, but a common failure mode which routinely happens today to teams who don't think carefully enough about what they're merging. I know a team of a half-dozen people who's been working for years to dig themselves out of that hole; because of bad code they shipped in the past, changes that should have taken a couple hours without agentic support take days or weeks even with agentic support.

movedx01 an hour ago | parent | prev | next [-]

Not having a code review process is archaic engineering practice at this point(at any point in history, really), be it for human written or AI written code.

mememememememo 2 hours ago | parent | prev | next [-]

Depends on your prod.

For an early startup validating their idea, that prod can take it.

For a platform as a service used by millions, nope.

miltonlost 4 hours ago | parent | prev | next [-]

You say it's borderline archaic. I say trusting agents enough to not look at every single line is an abdication of ethics, safety, and engineering. You're just absolving yourself of any problems. I hope you aren't working in medical devices or else we're going to get another Therac-25. Please have some sort of ethics. You are going to kill people with your attitude.

tru1ock 3 hours ago | parent [-]

Almost nobody works on medical devices... And some of you lucky folks might be working with mega minds everyday, but the rest of us are but shadows and dust. I trust 5.4 or 4.6 more than most developers. Through applying specific pressure using tests and prompts I force it to built better code for my silly hobby game than I ever saw in real production software. Before those models I was still on the other side of the line but the writing is on the wall.

slopinthebag 3 hours ago | parent | prev [-]

If you keep the scope small enough it can be production ready ootb, and with some stuff (eg. a throwaway React component) who really cares. But I think it's insane to look at the output of Claude Code or Codex with frontier models and say "yep, that looks good to me".

Fwiw OP isn't an agent skeptic, he wrote one of the most popular agent frameworks.