Both Codex and Claude got worse this week. Across every plan I retested

wonderwhyer 14 hours ago | parent | next [-]

Building a tool to compare value across LLM provider options.

Part of it tracks how many tokens you actually get from various subscriptions, over time.

Past week, multiple people asked me about it — they'd been hitting Claude and Codex limits faster than expected.

Ran the tests yesterday. Reran today. Here's what came back: ▸ ChatGPT Plus / GPT-5.5: 95M → 37M tokens/week (−61%) ▸ Claude Max 20× / Sonnet 4.6: 388M → 214M (−45%) ▸ Claude Max 20× / Opus 4.7: 248M → 162M (−35%) ▸ Claude Pro / Sonnet 4.6: 19.6M → 11.4M (−42%) ▸ Claude Pro / Opus 4.7: 15.6M → 10.2M (−35%)

5 of 5 retested plans dropped 35-61% in five days. None went up.

Anyone else seeing similar in their own usage?

▲

panny 14 hours ago | parent [-]

Vouched you. Side note: Perhaps include anchor tags with IDs so you can skip to a section like

>Value over time by provider

Via a #fragment at the end of your url. It looks like you're selling me something at the top of the page so I can see how you were flagged dead.

	▲	wonderwhyer 7 hours ago \| parent [-]
		Ouh, right, will try to add that anchor now! Damn I can't edit. But did add anchors to separate charts: https://desktopcommander.app/best-value-ai/#sub-tokens-timel...

▲

derbOac 13 hours ago | parent | prev | next [-]

"Quality metrics" need much more discussion and attention, in my opinion.

Not a criticism of this project — it's a good idea, it just highlights the central question of "how well is this model working?" I'm not sure it's so straightforward.

	▲	wonderwhyer 12 hours ago \| parent [-]
		I agree! My "dream" way to do it is closer to how Aider Leaderboard works but even bit better. To have GDPEval like set to tasks but you have information across all tasks and all models of how much time/tokens/money/quality you get from particular model on particular task. I was thinking to do evals against skills in that sense. But that is huge and expensive project. Only "approximation" I could pull of reasonably to get this started was to use benchmark scores as "surrogate" for that. But working on a way to get this going. If you have additional thoughts on how to approach this I it would be super valuable.

▲

jdw64 14 hours ago | parent | prev | next [-]

AI always seems to perform best on the first day after release, and then its performance gradually declines.

Is the AI itself degrading? Or is it because of product-policy changes, such as system prompt modifications and usage limits? Or is it both?

I sometimes wonder whether degradation is simply an inherent property of LLMs themselves.

	▲	wonderwhyer 12 hours ago \| parent [-]
		What we learned recently is that they are tuning things. There is less compute, less tokens, less of what ever else is there thrown at it as time goes on. They use these releases to get users. When they got them they can play around with "degrading" model just enough not to loose users but save on costs. It sadly kinda makes sense...

▲

saidnooneever 14 hours ago | parent | prev [-]

so happy clang's output is consistently great

	▲	wonderwhyer 12 hours ago \| parent [-]
		:D