Remix.run Logo
minimaxir 3 days ago

> When I upgraded to Sonnet 4.5, it became less often that it gave phantom answers. But it still sometimes wasn’t able to handle some complex problems. We’d go back and forth and I’d try to give it hints. But with Opus 4.5, that happens much less often now.

The real annoying thing about Opus 4.5 is that it's impossible to tell most people "Opus 4.5 is an order of magnitude better than coding LLMs released just months before it" without sounding like a AI hype booster clickbaiting, but it's the counterintuitive truth. To my continual personal frustration.

sielakis 3 days ago | parent | next [-]

The thing is, it still feels like a mixed bag for me.

It's good enough for things I can define well and write okay code for.

But it is far from perfect.

It does too much, like any LLM. For example, I had some test cases for deleted methods, and I was being lazy and didn't want to read a huge test file, so I asked it to fix it.

It did. Tests were green because it mocked non-existing methods, while it should have just deleted the test cases as they were no longer needed.

Luckily, I read the code it produced.

The same thing happened with a bit of decorators I asked it to write in Python. It produced working code, tests were fine, but I reworked the code manually to 1/10 of the size proposed by Opus.

It seems magical, even thinking, but like all LLMs, it is not. It is just a trap.

j16sdiz 3 days ago | parent | next [-]

Small tips:

When LLMs try to do the wrong thing, don't correct it with new instruction. Instead, edit your last prompt and give more details there.

LLM have limited context length, and they love stuck to their previous error. Just edit the previous prompt. Don't let the failed attempt pollute your context.

sielakis 2 days ago | parent [-]

I know. It was just me being too lazy to write proper prompt.

And code size thing is not fixed by better prompt.

It also likes to even ignore reasonable plan it writen itself just to add more code.

risyachka 2 days ago | parent | prev [-]

>> but I reworked the code manually to 1/10 of the size proposed by Opus.

yeah it writes so much code its crazy - where it can be solved, like you mentioned, with 1/10th

I mean they are in the token business, so this is expected to continue as long as they possibly can as long as they are a bit better than competition.

This is what 99% of devs that praise Claude Code don't notice. The real productivity gains are much lower than 10x.

Maybe they are like 2x tops.

The real gains is that you can be lazy now.

In reality most tasks you do with LLM (not talking about greenfield projects, those are vanity metrics) can be completed by human in mostly same time with 1/10th of code - but the catch here is you need to actually think and work instead of talking to chat or watching YouTube while prompt is running, which becomes 100x harder after you use LLM extensively for a week or so.

maccard 3 days ago | parent | prev | next [-]

> The real annoying thing about Opus 4.5 is that it's impossible to tell most people "Opus 4.5 is an order of magnitude better than coding LLMs released just months before it" without sounding like a AI hype booster clickbaiting, but it's the counterintuitive truth. To my continual personal frustration.

The problem is that these increases in model performance are like the boy who cried wolf. There's only so many times you can say "this model is so much better, and does X/Y/Z more/less" and have it _still_ not be good enough for general use.

robrain 2 days ago | parent [-]

Indeed - it’s like the last hundred years of detergent marketing: “the whitest whites ever, the gentlest wash you’ve ever experienced”. Then six months later another advance from the boffins in their lab coats. All the time it’s just soap.

virtualritz 3 days ago | parent | prev | next [-]

What people do not understand is that this really depends on what language you target. So if I write Rust then you sound like an AI hype booster but if I write TS or Python maybe not so much.

From my experience Opus is only good at writing Rust. But it's great at something like TS because the amount of code it has been trained on is probably orders of magnitude bigger for the latter language.

I still use Codex high/xhigh for planning and once the plan is sound I give it to Opus (also planning). That plan I feed back to Codex for sign-off. It takes an average additional 1-2 rounds of this before Opus makes a plan that Codex says _really_ ticks all the boxes of the plan it made itself and which we gave to Opus to start with ...

That tells you something.

Also when Opus is "done" and claims so I let Codex check. Usually it has skipped the last 20% (stubs/todos/logic bugs) so Codex makes a fixup plan that then again goes to through the Codex<->Opus loop of back and forth 2-3 rounds before Codex gives the thumbs up. Only after that has Opus managed to do what the inital plan said that Codex made in the first place.

When I have Opus write TS code (or Python) I do not have to jump through those hoops. Sometimes one round of back and forth is needed but never three, as with Rust.

throwaway2027 3 days ago | parent | prev | next [-]

First impressions matter. I felt the same reading comments suggesting that people who praised GPT-5.2-Codex recently were shilling for OpenAI when it has actually gotten much better, faster and the most important one, more time before you reach your weekly limit.

inferiorhuman 3 days ago | parent | prev [-]

  To my continual personal frustration.
That's not the fault of Opus 4.5 because like all AI nonsense it's still not worth the cost. The privacy given up by having to authenticate with services like Github that used to be publicly available before getting constantly DDoSed by AI bots. The reliability and freedom that evaporated into the ether as folks run to the shelter of Cloudflare to mitigate the endless DDoS attacks at the hands of AI data scrapers. The emotional and social development stunted by having AI chatbots pretend to be a significant other and only say what folks want to hear. Whether Opus "can" code is immaterial.