| ▲ | Retr0id 5 hours ago |
| This seems anecdotal but with extra words. I'm fairly sure this is just the "wow this is so much better than the previous-gen model" effect wearing off. |
|
| ▲ | codessta 4 hours ago | parent | next [-] |
| I've always been a believer in the "post honey-moon new model phase" being a thing, but if you look at their analysis of how often the postEdit hooks fire + how Anthropic has started obfuscating thinking blocks, it seems fishy and not just vibes |
| |
| ▲ | robertfw 3 hours ago | parent [-] | | I was in this camp as well until recently, in the last 2-3 weeks I've been seeing problems that I wasn't seeing before, largely in line with the issues highlighted in the ticket (ownership dodging, hacky fixes, not finishing a task). |
|
|
| ▲ | rishabhaiover 5 hours ago | parent | prev | next [-] |
| Nope, there is a categorical degradation in quality of output, especially with medium to high effort thinking tasks. |
|
| ▲ | gchamonlive 5 hours ago | parent | prev | next [-] |
| What about the analysis evidences? |
| |
| ▲ | Retr0id 4 hours ago | parent [-] | | You mean the Claude output? The same claude that has "regressed to the point it cannot be trusted"? | | |
| ▲ | gchamonlive 3 hours ago | parent [-] | | What you saying the OP fabricated/hallucinated the evidence? | | |
| ▲ | Retr0id 3 hours ago | parent [-] | | I'm just saying it's epistemically unrigorous to the point of being equivalent to anecdata. | | |
| ▲ | gchamonlive 3 hours ago | parent [-] | | How should one conduct such a rigourously reproducible experiment when LLMs by nature aren't deterministic and when you don't have access to the model you are comparing to from months ago? | | |
| ▲ | Retr0id 3 hours ago | parent [-] | | Something like this: https://marginlab.ai/trackers/claude-code/ (see methodology section) | | |
| ▲ | gchamonlive 2 hours ago | parent | next [-] | | Kudos for the methodology. The only question I can come up with is that if the benchmarks are representative of daily use. Anecdotal or not, we see enough reports popping up to at least elicit some suspion as to service degradation which isn't shown in the charts. Hypothesis is that maybe the degradation experienced by users, assuming there is merit in the anecdotes, isn't picked up by the kind of tracking strategy used. | | | |
| ▲ | 2 hours ago | parent | prev [-] | | [deleted] |
|
|
|
|
|
|
|
| ▲ | rzmmm 4 hours ago | parent | prev [-] |
| I suspect you might be right but I don't really know. Wouldn't these proposed regressions be trivial to confirm with benchmarks? |