Remix.run Logo
When AI writes the software, who verifies it?(leodemoura.github.io)
58 points by todsacerdoti 3 hours ago | 39 comments
oakpond 2 hours ago | parent | next [-]

You do. Even the latest models still frequently write really weird code. The problem is some developers now just submit code for review that they didn't bother to read. You can tell. Code review is more important than ever imho.

sausagefeet an hour ago | parent | next [-]

I agree with you. But I have to say, it is an uphill battle and all the incentives are against you.

1. AI is meant to make us go faster, reviews are slow, the AI is smart, let it go.

2. There are plenty of AI maximizers who only think we should be writing design docs and letting the AI go to town on it.

Maybe, this might be a great time to start a company. Maximize the benefits of AI while you can without someone who has never written a line of code telling you that your job is going to disappear in 12 months.

All the incentives are against someone who wants to use AI in a reasonable way, right now.

redhed an hour ago | parent [-]

I actually agree with good time to start a company. Lot of available software engineers that can actually understand code, AI at a level that can actually speed up development, and so many startups focusing on AI wrapper slop that you can actually make a useful product and separate yourself from the herd.

Or you can be a grifter and make some AI wrapper yourself and cash out with some VC investment. So good time for a new company either way.

bradleykingz an hour ago | parent | prev | next [-]

But it's so BORING. AI gets to do the fun part (writing code) and I'm stuck with the lame bits.

It's like watching someone else solve a puzzle, or watching someone else play a game vs playing it yourself (at least that's half as interesting as playing it through)

HoldOnAMinute 6 minutes ago | parent | next [-]

I am really enjoying making requirements docs in an iterative process. I have a continuous improvement loop where I use the implementation to test out the docs. If I find a problem with the docs, I throw away the implementation, improve the docs, then re-implement. The kind of docs I'm getting are of amazing quality.

nz 4 minutes ago | parent | prev | next [-]

Your workplace has chosen to deprive you of the enjoyment that you got from the work. You have a few options: (1) ask for a raise proportional to the percentage of enjoyment that you lost, (2) find a workplace that does not do this, or (3) phone it in (they see you and your craft as something be milked for cash, so maybe stop letting yourself get milked, and milk them right back, by doing _exactly_ what is asked of you and _not_ more -- let these strategic geniuses strategize using their own brains).

stretchwithme 3 minutes ago | parent | prev | next [-]

I can solve a problem in 10% of the time. Dealing with an issue TODAY, instead of having to put it in the backlog.

lukan 38 minutes ago | parent | prev [-]

For me the most fun part is getting something that works. Design the goal, but not micromanage and get lost in the details. I love AI for that, but it is hard really owning code this way. (At least I manually approve every or most changes, but still, verifying is hard).

bitwize 18 minutes ago | parent [-]

AI has really sharpened the line between the Master Builders of the world and the Lord Businesses along this question: What, exactly, is the "fun part" of programming? Is it simply having something that works? Or is it the process of going from not having it to having it through your own efforts and the sum total of decisions you made along the way?

MrDarcy 2 hours ago | parent | prev | next [-]

It is remarkably effective to have Claude Code do the code review and assign a quality score, call it a grade, to the contribution derived from your own expectations of quality.

Then don’t even bother looking at C work or below.

NitpickLawyer an hour ago | parent [-]

IME it works even better if you use another model for review. We've seen code by cc and review by gpt5.2/3 work very well.

Also works with planning before any coding sessions. Gemini + Opus + GPT-xhigh works to get a lot of questions answered before coding starts.

xienze an hour ago | parent | prev [-]

> The problem is some developers now just submit code for review that they didn't bother to read.

Can you blame them? All the AI companies are saying “this does a better job than you ever could”, every discussion topic on AI includes at least one (totally organic, I’m sure) comment along the lines of “I’ve been developing software for over twenty years and these tools are going to replace me in six months. I’m learning how to be a plumber before I’m permanently unemployed.” So when Claude spits out something that seems to work with a short smoke test, how can you blame developers for thinking “damn the hype is real. LGTM”?

jf22 an hour ago | parent | next [-]

I'm an 99% organic person (I suppose I have tooth fillings) and the new models write code better than I do.

I've been using LLMS for 14+ months now and they've exceeded my expectations.

HoldOnAMinute 4 minutes ago | parent | next [-]

Not only do they exceed expectations, but any time they fall down, you can improve your instructions to them. It's easy to get into a virtuous cycle.

xienze 18 minutes ago | parent | prev [-]

So are you learning a trade? Or do you somehow think you’ll be one of the developers “good enough” to remain employed?

jf22 11 minutes ago | parent [-]

I have a physical goods side hustle already and I'm brainstorming ideas about a trade I can do that will benefit from my programming experience.

I'm thinking HVAC or painting lines in parking lots. HVAC because I can program smart systems and parking lot lines because I can use google maps and algos to propose more efficient parking lot designs to existing business owners.

There is that paradox when if something becomes cheaper there is more demand so we'll see what happens.

Finally, I'm a mediocre dev that can only handle 2-3 agents at a time so I probably won't be good enough.

bluefirebrand 30 minutes ago | parent | prev [-]

> Can you blame them?

Yes I absolutely can and do blame them

muraiki 40 minutes ago | parent | prev | next [-]

The article says that AWS's Cedar authorization policy engine is written in Lean, but it's actually written in Dafny. Writing Dafny is a lot closer to writing "normal" code rather than the proofs you see in Lean. As a non-mathematician I gave up pretty early in the Lean tutorial, while in a recent prototype I learned enough Dafny to be semi-confident in reviewing Claude's Dafny code in about half a day.

The Dafny code formed a security kernel at the core of a service, enforcing invariants like that an audit log must always be written to prior to a mutating operation being performed. Of course I still had bugs, usually from specification problems (poor spec / design) or Claude not taking the proof far enough (proving only for one of a number of related types, which could also have been a specification problem on my part).

In the end I realized I'm writing a bunch of I/O bound glue code and plain 'ol test driven development was fine enough for my threat model. I can review Python code more quickly and accurately than Dafny (or the Go code it eventually had to link to), so I'm back to optimizing for humans again...

_pdp_ an hour ago | parent | prev | next [-]

I think the issue goes even deeper than verification. Verification is technically possible. You could, in theory, build a C compiler or a browser and use existing tests to confirm it works.

The harder problem is discovery: how do you build something entirely new, something that has no existing test suite to validate against?

Verification works because someone has already defined what "correct" looks like. There is possible a spec, or a reference implementation, or a set of expected behaviours. The system just has to match them.

But truly novel creation does not have ground truth to compare against and no predefined finish line. You are not just solving a problem. You are figuring out what the problem even is.

Avshalom an hour ago | parent [-]

Well that's a problem the software industry has been building for itself for decades.

Software has, since at least the adoption of "agile" created an industry culture of not just refusing to build to specs but insisting that specs are impossible to get from a customer.

daveguy 36 minutes ago | parent [-]

Agile hasn't been insisting that specs are impossible to get from a customer. They have been insisting that getting specs from a customer is best performed as a dynamic process. In my opinion, that's one of agile's most significant contributions. It lines up with a learning process that doesn't assume the programmer or the customer knows the best course ahead of time.

yoaviram 27 minutes ago | parent | prev | next [-]

I just finished writing a post about exactly this. Software development, as the act of manually producing code, is dying. A new discipline is being born. It is much closer to proper engineering.

Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.

The marginal cost of code is collapsing. That single fact changes everything.

https://nonstructured.com/zen-of-ai-coding/

MattDaEskimo 25 minutes ago | parent [-]

Accountability then

yoaviram 23 minutes ago | parent [-]

Anticipating modes of failure

simonw an hour ago | parent | prev | next [-]

The "Nearly half of AI-generated code fails basic security tests" link provided in this piece is not credible in my opinion. It's a very thinly backed vendor report from a company selling security scanning software.

holtkam2 an hour ago | parent | prev | next [-]

At the end of the day you need humans who understand the business critical (or safety critical) systems that underpin the enterprise.

Someone needs to be held accountable when things go wrong. Someone needs to be able to explain to the CEO why this or that is impossible.

If you want to have AI generate all the code for your business critical software, fine, but you better make sure you understand it well. Sometimes the fastest path to deep understanding is just coding things out yourself - so be it.

This is why the truly critical software doesn’t get developed much faster when AI tools are introduced. The bottleneck isn’t how fast the code can be created, it’s how fast humans can construct their understanding before they put their careers on the line by deploying it.

Ofc… this doesn’t apply to prototypes, hackathons, POCs, etc. for those “low stakes” projects, vibe code away, if you wish.

indymike an hour ago | parent | prev | next [-]

Because of the scale of generated code, often it is the AI verifying the AI's work.

tartoran an hour ago | parent [-]

So who's verifying the AI doing the verifying or is it yet another AI layer doing that? If something goes wrong who's liable, the AI?

visarga 9 minutes ago | parent [-]

You have 2 paths - code tests and AI review which is just vibe test of LGTM kind, should be using both in tandem, code testing is cheap to run and you can build more complex systems if you apply it well. But ultimately it is the user or usage that needs to direct testing, or pay the price for formal verification. Most of the time it is usage, time passing reveals failure modes, hindsight is 20/20.

bitwize 22 minutes ago | parent | prev | next [-]

Also AI.

acedTrex 2 hours ago | parent | prev | next [-]

No one does currently, and its going to take a few very painful and high profile failures of vital systems for this industry to RELEARN its lesson about the price of speed.

In fact it will probably need to happen a few times PER org for the dust to settle. It will take several years.

arscan an hour ago | parent | next [-]

Sure but industry cares about value (= benefit - price), not just price. Price could be astronomical, but that doesn’t matter if benefit is larger.

jcgrillo an hour ago | parent | prev [-]

I feel like people used to talk about nines of uptime more. As in more than one. These days we've lost that: https://bsky.app/profile/jkachmar.com/post/3mg4u3e6nak2p

I recall a time, maybe around 2013-2017, when people were talking about 4 or 5 nines. But sometime around then the goalposts shifted, and instead of trying to make things as reliable as possible, it started becoming more about seeing how unreliable they can get before anyone notices or cares. It turns out people will suffer through a lot if there's some marginal benefit--remember what personal computers were like in the 1990s before memory protection? Vibe coding is just another chapter in that user hostile epic. Convenient reliability, like this author describes, (if it can be achieved) might actually make things better? But my money isn't on that.

lgl an hour ago | parent | prev | next [-]

I'm in the process of building v2.0 of my app using opus 4.6 and largely agree with this.

It's pretty awesome but still does a lot of basic idiotic stuff. I was implementing a feature that required a global keyboard shortcut and asked opus to define it, taking into account not to clash with common shortcuts. He built a field where only one modifier key was required. After mentioning that this was not safe since users could just define CTRL+C for the shortcut and we need more safeguards and require at least two modifier keys I got the usual "you're absolutely right" and proceeded to require two modifier keys. But then it also created a huge list of common shortcuts into a blacklist like copy, cut, paste, print, select all, etc.. basically a bunch of single modifier key shortcuts. Once I mentioned that since we're already forcing two modifier keys that's useless it said I'm right again and fixed it.

The counter point of this idiocy is that it's very good overall at a lot of what is (in my mind) much more complicated stuff. It's a .NET app and stuff like creating models, viewmodels, usercontrols, setting up the entire hosting DI with pretty much all best practices for .net it does it pretty awesomely.

tl;dr is that training wheels are still mandatory imho

righthand 2 hours ago | parent | prev | next [-]

No one really. Code is for humans to read and for machines to compile and execute. Llms are enabling people to just write the code and not have anyone read it. It’s solving a problem that didn’t really exist (we already had code generators before llms).

It’s such an intoxicating copyright-abuse slot machine that a buddy who is building an ocaml+htmx tree editor told me “I always get stuck and end up going to the llm to generate code. Usually when I get to the html part.” I asked if he used a debugger before that, he said “that’s a good idea”.

galbar an hour ago | parent [-]

This is something I've been wondering about...

If boilerplate was such a big issue, we should have worked on improving code generation. In fact, many tools and frameworks exist that did this already:

- rails has fantastic code generation for CRUD use cases

- intelliJ IDEs have been able to do many types of refactors and class generation that included some of the boilerplate

I haven't reached a conclusion on this train of thought yet, though.

foolfoolz 2 hours ago | parent | prev | next [-]

no one wants to believe this but there will be a point soon when an ai code review meets your compliance requirements to go to production. is that 2026? no. but it will come

righthand 2 hours ago | parent [-]

We already have specifications though, so that’s not different. What happens when the AI is wrong and wont let anyone deploy to production?

rademaker 2 hours ago | parent | prev [-]

In his latest essay, Leonardo de Moura makes a compelling case that if AI is going to write a significant portion of the world’s software, then verification must scale alongside generation. Testing and code review were never sufficient guarantees, even for human-written systems; with AI accelerating output, they become fundamentally inadequate. Leo argues that the only sustainable path forward is machine-checked formal verification — shifting effort from debugging to precise specification, and from informal reasoning to mathematical proof checked by a small, auditable kernel. This is precisely the vision behind Lean: a platform where programs and proofs coexist, enabling AI not just to generate code, but to generate code with correctness guarantees. Rather than slowing development, Lean-style verification enables trustworthy automation at scale.