Remix.run Logo
simianwords 2 hours ago

Honest question: why is this not enough?

If the code passes tests, and also works at the functionality level - what difference does it make if you’ve read the code or not?

You could come up with pathological cases like: it passed the tests by deleting them. And the code written by it is extremely messy.

But we know that LLMs are way smarter than this. There’s very very low chance of this happening and even if it does - it quick glance at code can fix it.

kranner 2 hours ago | parent | next [-]

You can't test everything. The input space may be infinite. The app may feel janky. You can't even be sure you're testing all that can be tested.

The code may seem to work functionally on day 1. Will it continue to seem to work on day 30? Most often it doesn't.

And in my experience, the chances of LLMs fucking up are hardly very very low. Maybe it's a skill issue on my part, but it's also the case that the spec is sometimes discovered as the app is being built. I'm sure this is not the case if you're essentially summoning up code that exists in the test set, even if the LLM has to port it from another language, and they can be useful in parts here and there. But turning the controls over to the infinite monkey machine has not worked out for me so far.

CuriouslyC 11 minutes ago | parent [-]

If you care about performance, test it (stress test).

If you care about security, test it (red teaming).

If you care about maintainability, test it (advanced code analysis)

Your eyeballs are super fallible, this is why bad engineers exist. Get rigorous.

throwup238 2 hours ago | parent | prev | next [-]

It depends on the scale of complexity you’re working at and who your users are going to be. I’ve found that it’s trivial to have Claude Code spit out so much functionality that even just proper manually verifying it becomes a gargantuan task. I end up just manually testing the pieces I’m familiar with which is fine if there’s a QA department who can do a full run through of the feature and are prepared to deal with vibe coding pitfalls, but not so much on open source projects where slop gets shipped and unfamiliar users get stuck with bugs they can’t possibly troubleshoot. Writing the code from scratch The Old Way™ leaves a lot less room for shipping convincing but non functional slop because the dev has to work through it before shipping.

The most immediate example I can think of is the beans LLM workflow tracker. It’s insane that its measured in the 100s of thousands of LoC and getting that thing setup in a repo is a mess. I had to use Github copilot to investigate the repo to get the latest method. This wouldn’t fly at my employer but a lot of projects are going to be a lot less scrupulous.

You can see the effects in popular consumer facing apps too: Anthropic has drunk way too much of its own koolaid and now I get 10-50% failure rates on messages in their iOS app depending on the day. Some of their devs have publicly said that Claude writes 100% of their code and its starting to show. Intermittent network failures and retries have been a solved problem for decades, ffs!

jdjdjssh 2 hours ago | parent | prev [-]

> If the code passes tests, and also works at the functionality level

Why doesn’t outsourcing work if this is all that is needed?

jmathai 2 hours ago | parent | next [-]

We haven’t fully proven that it is any different. Not at scale anyway. It took a decade for the seams of outsourcing to break.

But I have a hypothesis.

The quality of the output, when you don’t own the long term outcome or maintenance, is very poor.

This is not the case with AI in the same sense it is with human contractors.

simianwords 2 hours ago | parent | prev [-]

Why do we have managers if managers don’t have accountability?

jdjdjssh an hour ago | parent [-]

I’m not sure what you’re getting at. I’m saying there’s a lot more to creating useful software than “tests pass / limited functionality checks work” from a purely technical perspective.