Remix.run Logo
anonyfox 7 hours ago

even sonnet right now has degraded for me to the point of like ChatGPT 3.5 back then. took ~5 hours on getting a playwright e2e test fixed that waited on a wrong css selector. literlly, dumb as fuck. and it had been better than opus for the last week or so still... did roughly comparable work for the last 2 weeks and it all went increasingly worse - taking more and more thinking tokens circling around nonsense and just not doing 1 line changes that a junior dev would see on the spot. Too used to vibing now to do it by hand (yeah i know) so I kept watching and meanwhile discovered that codex just fleshed out a nontrivial app with correct financial data flows in the same time without any fuzz. I really don't get why antrhopic is dropping their edge so hard now recently, in my head they might aim for increasing hype leading to the IPO, not disappointment crashes from their power user base.

solenoid0937 7 hours ago | parent [-]

You are operating purely on vibes, https://marginlab.ai/trackers/claude-code-historical-perform...

anonyfox 7 hours ago | parent [-]

not rejecting reality, but increasing doubts about the effectiveness of these tests. and yes its subjective n=1, but I literally create and ship projects for many months now always from the same github template repository forked and essentially do the same steps with a few differnt brand touches and nearly muscle memory prompting to do the just right next steps mechanically over and over again, and the amount of things getting done per step gots worse and the quality degraded too, forgetting basic things along the way a few prompts in. as I said n=1 but the very repetitive nature of my current work days alwyas doing a new thing from the exact same start point that hasn't changed in half a year is kind of my personal benchmark. YMMV but on my end the effects are real, specifically when tracking hours over this stuff.

deaux 5 hours ago | parent [-]

You use Claude Code? Then harness changes will have had much more impact than any model "stealth nerfing".

anonyfox 5 hours ago | parent [-]

Both CC but also cursor with raw api calls.