Remix.run Logo
deng 12 hours ago

Even then, "resolving merge conflicts along the way" doesn't mean anything, as there are two trivial merge strategies that are always guaranteed to work ('ours' and 'theirs').

anonzzzies 7 hours ago | parent | next [-]

We use claude code a lot for updating systems to a newer minor/major version. We have our own 'base' framework for clients which is a, by now, very large codebase that does 'everything you can possibly need'; so not only auth, but payments, billing, support tickets, email workflows, email wysiwyg editing, landing page editor, blogging, cms, AI /agent workflows etc etc (across our client base, we collect features that are 'generic' enough and create those in the base). It has many updates for the product lead working on it (a senior using Claude code) but we cannot just update our clients (whose versions are sometimes extremely customised/diverging) at the same pace; some do not want updates outside security, some want them once a year etc. In this case AI has been really a productivity booster; our framework always was quite fast moving before AI too when we had 3.5 FTE (client teams are generally much larger, especially the first years) on it but then merging, that to mean; including the new features and improvements in the client version that are in the new framework version without breaking/removing changes on the client side, was a very painful process taking a lot of time and at at least 2 people for an extended period of time; one from the client team, one from the framework team. With CC it is much less painful: it will merge them (it is not allowed, by hooks, to touch the tests), it will run the client tests and the new framework tests and report the difference. That difference is evaluated usually by someone from the client team who will then merge and fix the tests (mostly manually) to reflect the new reality and test the system manually. Claude misses things (especially if functionalities are very similar but not exactly the same, it cannot really pick which to take so it does nothing usually) but the biggest bulk/work is done quickly and usually without causing issues.

paulus_magnus2 12 hours ago | parent | prev | next [-]

Haha. True, CI success was not part of PR accept criteria at any point.

If you view the PRs, they bundle multiple fixes together, at least according to the commit messages. The next hurdle will be to guardrail agents so that they only implement one task and don't cheat by modifying the CI piepeline

formerly_proven 12 hours ago | parent [-]

If I had a nickel for every time I've seen a human dev disable/xfail/remove a failing test "because it's wrong" and then proceeding to break production I would have several nickels, which is not much, but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

vizzier 10 hours ago | parent | next [-]

> but does suggest that deleting failing tests, like many behaviors, is not LLM-specific.

True, but it is shocking how often claude suggests just disabling or removing tests.

ewoodrich 3 hours ago | parent | next [-]

The sneaky move that I hate most is when Claude (and does seem to mostly be a Claude-ism I haven’t encountered on GPT Codex or GLM) is when dealing with an external data source (API, locally polling hardware, etc) as a “helpful” fallback on failures it returns fake data in the shape of the expected output so that the rest of the code “works”.

Latest example is when I recently vibe coded a little Python MQTT client for a UPS connected to a spare Raspberry Pi to use with Home Assistant, and with a just few turns back and forth I got this extremely cool bespoke tool and felt really fun.

So I spent a while customizing how the data displayed on my Home Assistant dashboard and noticed every single data point was unchanging. It took a while to realize because the available data points wouldn’t be expected to change a whole lot on a fully charged UPS but the voltage and current staying at the exact same value to a decimal place for three hours raised my suspicions.

After reading the code I discovered it had just used one of the sample command line outputs from the UPS tool I gave it to write the CLI parsing logic. When an exception occurred in the parser function it instead returned the sample data so the MQTT portion of the script could still “work”.

Tbf Claude did eventually get it over the finish line once I clarified that yes, using real data from the actual UPS was in fact an important requirement for me in a real time UPS monitoring dashboard…

icedchai 5 hours ago | parent | prev | next [-]

A coworker opened a PR full of AI slop. One of the first things I do is check if the tests pass. Of course, the didn't. I asked them to fix the tests, since there's no point in reviewing broken code.

"Fix the tests." This was interpreted literally, and assert status == 200 got changed to assert status == 500 in several locations. Some tests required more complex edits to make them "pass."

Inquiries about the tests went unanswered. Eventually the 2000 lines of slop was closed without merging.

4 hours ago | parent [-]
[deleted]
ciaranmca 8 hours ago | parent | prev | next [-]

100%, trying a bit of an experiment like this(similar in that I mostly just care about playing around with different agents, techniques etc.) it has built out literally hundreds of tests. Dozens of which were almost pointless as it decided to mock apis. When the number of failed tests exceeded 40 it just started disabling tests.

icedchai 5 hours ago | parent [-]

To be fair, many human developers are fond of pointless tests that mock everything to the extent that no real code is actually exercised. At least the tests are fast though.

falkensmaize 32 minutes ago | parent [-]

Citing the absolute worst practices from terrible developers as a way to exonerate or legitimize LLM code production issues is something we need to stop doing in my opinion. I would not excuse or expect a day one junior on my team that wrote pointless tests or worse yet removed tests to get the CI to pass.

If LLMs do this it should be seen as an issue and should not be overlooked with “people do it too…”. Professional developers do not do this. If we’re going to use Ai for creating production code we need to be honest about its deficiencies.

zephen 8 hours ago | parent | prev [-]

> it is shocking how often claude suggests just disabling or removing tests.

Arguably, Claude is simply successfully channeling what the developers who wrote the bulk of its training data would do. We've already seen how bad behavior injected into LLMs in one domain causes bad behavior in other domains, so I don't find this particularly shocking.

The next frontier in LLMs has to be distinguishing good training data from bad training data. The companies have to do this, even if only in self defense against the new onslaught of AI-generated slop, and against deliberate LLM poisoning.

If the models become better at critically distinguishing good from bad inputs, particularly if they can learn to treat bad inputs as examples of what not to do, I would expect one benefit of this is that the increased ability of the models to write working code will then greatly increase the willingness of the models to do so, rather than to simply disable failing tests.

dullcrisp 8 hours ago | parent | prev | next [-]

If I had a nickel for every time I’ve seen a human being pull down their pants and defecate in the middle of the street I’d have a couple nickels. That’s not a lot but it suggests that this behavior is not LLM specific.

mickdarling 7 hours ago | parent | prev [-]

Had humans not been doing this already, I would have walked into Samsung with the demo application that was working an hour before my meeting, rather than the android app that could only show me the opening logo.

There are a lot of really bad human developers out there, too.

moregrist 4 hours ago | parent [-]

> Entrepreneur, CEO and founder of Tomorrowish a social media DVR

So you flubbed managing a project and are now blaming your employees. Classy.

fzzzy 12 hours ago | parent | prev | next [-]

that’s not guaranteed to work. Other parts of the CodeBase that didn’t conflict could depend on the discarded code.

madeofpalk 11 hours ago | parent | next [-]

The point is that the merge conflict was resolved, regardless of whether there was a working product at the end. Which there apparently isn’t.

formerly_proven 12 hours ago | parent | prev | next [-]

Well they did mention the code doesn't work.

nyeah 11 hours ago | parent [-]

Where did Cursor say that?

logicallee 8 hours ago | parent [-]

It's implied by the fact that early in the post they say:

>"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."

and then near the end, they say:

>"Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects."

This means they only make progress toward it, but do not "build a web browser from scratch".

If you're curious, the State of Utopia (will be available at https://stateofutopia.com ) did build a web browser from scratch, though it used several packages for the networking portion of it.

See my other comments and posts for links.

dingnuts 12 hours ago | parent | prev [-]

[dead]

PunchyHamster 9 hours ago | parent | prev [-]

So, AI agent battle royale