Remix.run Logo
yandie 2 hours ago

I've been running Opus 4.8 for agentic coding and I don't see it being significantly better than Sonnet 4.5 (not that I can tell). I find that pairing Google Gemini and Claude (having Gemini review Claude's code) seems to yield better results. Curious if this jump to 80.3% score in agentic coding will make me see a big difference in actual usage.

testfrequency 2 hours ago | parent | next [-]

I do the same, and have excellent results. Gemini 3.1 Pro high diagnosed and solved 3 complex issues today that Opus Max was stumbling on for a few hours in one shot. This was even when I started new chats and tried debugging with Ultracode instead with Claude.

As much as people on HN like to dunk on Gemini, I’ve always found it to be pretty good at understanding a code base more than Claude.

FailMore an hour ago | parent [-]

What harness do you use Gemini in?

testfrequency 19 minutes ago | parent [-]

agy cli. It’s been rock solid.

vorticalbox 2 hours ago | parent | prev | next [-]

for the last few weeks I have been using composer 2.5 (cursors fine tune of kimi 2.5) and honestly i don't see it worth the price to use 5.5, opus or sonnet any more. for almost all the tasks i have given it, it has handled it perfectly well and is a lot cheaper.

if I get a harder challenge for it i'll jump up a model for planning until that its been solid.

yandie 2 hours ago | parent | next [-]

Agree. Deepseek has also been pretty good for my personal use.

I'm struggling to see the moat for these models. What's stopping a competitor or a Chinese lab fromr releasing a comparable one?

qingcharles 2 hours ago | parent | prev [-]

I use Composer 2.5 because it comes free with Grok, and it's obviously better than using Grok, but it is far worse than GPT5.5 in my daily usage :(

yaodub 2 hours ago | parent | prev | next [-]

SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.

jp0001 2 hours ago | parent | prev | next [-]

You should throw GPT into the mix to UX/UI and call it the three stooges.

thisisnotclear an hour ago | parent | prev | next [-]

I find not much difference between Sonnet 4.6 and opus models too for most task that I need - maybe my needs are not enough for frontier models

mzhaase 2 hours ago | parent | prev | next [-]

I now chat with opus about architecture, let it make an implementation plan, and then it calls codewhale with deepseek in parallel on all tasks, reviewing their output. Works pretty well.

yandie 2 hours ago | parent [-]

I use spec-driven development heavily (generate architecture docs + specs first). Opus still get lost often and have to be nudged constantly. Like it can get super detailed for something like some deep SQL optimization but it just can't keep hold of the bigger picture.

jansan an hour ago | parent | prev [-]

After having worked with Opus 4.7 for a while I accidentially continued a session that was using Sonnet 4.5 and it felt just very dumb. The replies were much shallower than what I was used to, context was ingored, mistakes were made. I don't think there is a big difference between Opus 4.6 and 4.8, but to Sonnet 4.5 the difference is palpable.