Remix.run Logo
honeycrispy 3 days ago

A couple weeks ago I had Opus 4.5 go over my project and improve anything it could find. It "worked" but the architecture decisions it made were baffling, and had many, many bugs. I had to rewrite half of the code. I'm not an AI hater, I love AI for tests, finding bugs, and small chores. Opus is great for specific, targeted tasks. But don't ask it to do any general architecture, because you'll be soon to regret it.

tda 3 days ago | parent | next [-]

Instead you should prompt it to come up with suggestions, look for inconsistencies etc. Then you get a list, and you pick the ones you find promising. Then you ask Claude to explain what why and how of the idea. And only then you let it implement something.

hollowturtle 2 days ago | parent [-]

And waste a lot of time reviewing and baby sitting

thousand_nights 3 days ago | parent | prev | next [-]

these models work best when you know what you want to achieve and it helps you get there while you guide it. "Improve anything you can find" sounds like you didn't really know

mcv 3 days ago | parent | next [-]

As a tool to help developers I think it's really useful. It's great at stuff people are bad at, and bad at stuff people are good at. Use it as a tool, not a replacement.

suzzer99 3 days ago | parent | prev [-]

"Improve anything you can find" is like going to your mechanic and saying "I'm going on a long road trip, can you tell me anything that needs to be fixed?"

They're going to find a lot of stuff to fix.

blub 2 days ago | parent [-]

Doing a vehicle check-up is a pretty normal thing to do, although in my case the mandatory (EU law) periodic ones are happening often enough that I generally don’t have to schedule something out of turn.

The few times I did go to a shop and ask for a check-up they didn’t find anything. Just an anecdote.

oncallthrow 3 days ago | parent | prev | next [-]

In my experience these models (including opus) aren’t very good at “improving” existing code. I’m not exactly sure why, because the code they produce themselves is generally excellent.

sothatsit 3 days ago | parent | prev | next [-]

I like these examples that predictably show the weaknesses of current models.

This reminds me of that example where someone asked an agent to improve a codebase in a loop overnight and they woke up to 100,000 lines of garbage [0]. Similarly you see people doing side-by-side of their implementation and what an AI did, which can also quite effectively show how AI can make quite poor architecture decisions.

This is why I think the “plan modes” and spec driven development are so important effective for agents, because it helps to avoid one of their main weaknesses.

[0] https://gricha.dev/blog/the-highest-quality-codebase

pugworthy 3 days ago | parent [-]

To me, this doesn't show the weakness of current models, it shows the variability of prompts and the influence on responses. Because without the prompt it's hard to tell what influenced the outcome.

I had this long discussion today with a co-worker about the merits of detailed queries with lots of guidance .md documents, vs just asking fairly open ended questions. Spelling out in great detail what you want, vs just generally describing what you want the outcomes to be in general then working from there.

His approach was to write a lot of agent files spelling out all kinds of things like code formatting style, well defined personas, etc. And here's me asking vague questions like, "I'm thinking of splitting off parts of this code base into a separate service, what do you think in general? Are there parts that might benefit from this?"

sothatsit 3 days ago | parent | next [-]

It is definitely a weakness of current models. The fact that people find ways around those weaknesses does not mean the weaknesses do not exist.

Your approach is also very similar to spec driven development. Your spec is just a conversation instead of a planning document. Both approaches get ideas from your brain into the context window.

OccamsMirror 3 days ago | parent | prev [-]

So which approach worked better?

pugworthy 2 days ago | parent [-]

Challenging to answer, because we're at different levels of programming. I'm Senior / Architect type with many years of experience programming, and he's an ME using code to help him with data processing and analysis.

I have a hunch if you asked which approach we took based on background, you'd think I was the one using the detailed prompt approach and him the vague.

enraged_camel 3 days ago | parent | prev | next [-]

>> A couple weeks ago I had Opus 4.5 go over my project and improve anything it could find. It "worked" but the architecture decisions it made were baffling, and had many, many bugs.

So you gave it an poorly defined task, and it failed?

NewsaHackO 2 days ago | parent [-]

Exactly, imagine if someone gave you a 100k LOC project and said improve anything you can.

vbezhenar 3 days ago | parent | prev | next [-]

I'm using AI tools to find issues in my code. 9/10 of their suggestions are utter nonsense and fixing them would make my code worse. That said, there are real issues they're finding, so it's worth it.

I wouldn't be surprised to find out that they will find issues infinitely, if looped with fixes.

rleigh 3 days ago | parent | prev [-]

I've found it to be terrible when you allow it to be creative. Constrain it, and it does much better.

Have you tried the planning mode? Ask it to review the codebase and identify defects, but don't let it make any changes until you've discussed each one or each category and planned out what to do to correct them. I've had it refactor code perfectly, but only when given examples of exactly what you want it to do, or given clear direction on what to do (or not to do).