| ▲ | turnsout 3 hours ago |
| This is probably entirely down to subtle changes to CC prompts/tools. I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks. Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing? |
|
| ▲ | FfejL 2 hours ago | parent | next [-] |
| Honest, good-faith question. Is CC getting better, or are you getting better at using it? And how do you know the difference? I'm an occasional user, and I can definitely see improvements in my prompts over the past couple of months. |
| |
| ▲ | rob 2 hours ago | parent | next [-] | | I agree with you, it's personally hard to tell. For me I've noticed it getting nothing but better over the past couple months, but I've been working on my workflows and tooling. For example, I used to use plan mode and would put everything in a single file and then ask it to implement it in a new session. Switching to the 'superpowers' plugin with its own skills to brainstorm and write plans and execute plans with batches and tasks seems to have made a big improvement and help catch things I wouldn't have before. There's a "get shit done" plugin that's similar that I want to explore as well. The code output always looks good to me for the most part though and I've never thought that it's getting dumber anything, so I feel like a lot of the improvements I see are because of a skill issue on my part trying to use everything. Obviously it doesn't help there's a new way to do things every two weeks though. | |
| ▲ | turnsout 2 hours ago | parent | prev [-] | | Good-faith answer: I can't be certain. But I've been using CC since its release, and Cursor before that (and actually going all the way back to GPT3 to do codegen in the Playground). After getting used to the CC workflow, the way that I use it has been pretty consistent. To be specific, I use basically the same AGENTS.md with small modifications for each project, and I live almost exclusively in Plan mode and the best model (currently Opus 4.5). My initial prompting is boilerplate at this point, and looks like this: (Explain overall objective / problem without jumping to a solution) (Provide all the detail / file references / past work I can think of) (Ask it "what questions do you have for me before we build a plan?") And then go back and forth until we have a plan. Compared to my work with CC six months ago, it's just much more capable, able to solve more nuanced bugs, and less likely to generate spaghetti code. |
|
|
| ▲ | arcanemachiner an hour ago | parent | prev | next [-] |
| The easiest way would be to quantize the model, and serve different quants based on the current demand. Higher volumes == worse quant == more customers served per GPU |
|
| ▲ | billylo 2 hours ago | parent | prev | next [-] |
| That's why benchmarks are useful. We all suffer from the shortcomings of human perception. |
| |
| ▲ | gpm 2 hours ago | parent | next [-] | | Benchmarks shortcomings are no worse... they inevitably measure something that is only close to the thing you actually care about, not the thing you actually care about. It's entirely plausible that this decreased benchmark score is because Anthropic's initial prompting of the model was overtuned to the benchmark and as they're gaining more experience with real world use they are changing the prompt to do better at that and consequentially worse at the benchmark. | | |
| ▲ | billylo 2 hours ago | parent [-] | | I wonder how best we can measure the usefulness of models going forward. Thumbs up or down? (could be useful for trends)
Usage growth from the same user over time? (as an approximation)
Tone of user responses? (Don't do this... this is the wrong path... etc.) |
| |
| ▲ | turnsout 2 hours ago | parent | prev [-] | | Benchmarks measure what they measure. But your subjective experience also matters. |
|
|
| ▲ | fragebogen 3 hours ago | parent | prev [-] |
| I was going to ask, are all other variables accounted for? Are we really comparing apples to apples here? Still worth doing obviously, as it serves a good e2e evaluations, just for curiosity's sake. |