I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

▲

thefourthchime 6 hours ago | parent | next [-]

I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.

▲

overfeed 3 hours ago | parent | next [-]

What were you using 6 months ago?

▲

withinboredom 3 hours ago | parent [-]

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

▲

hhh 39 minutes ago | parent [-]

The models don’t change.

	▲	esskay a minute ago \| parent \| next [-]
		Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.
	▲	tornikeo 33 minutes ago \| parent \| prev [-]
		On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

▲

rf15 3 hours ago | parent | prev [-]

You cannot afford the SOTA.

▲

weird-eye-issue 3 hours ago | parent [-]

Why is that? The $200 per month subscription comes with a ton of usage.

Opus 4.6 is available on the $20 plan too

▲

aleph_minus_one 11 minutes ago | parent | next [-]

> The $200 per month subscription comes with a ton of usage.

200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.

	▲	weird-eye-issue 5 minutes ago \| parent [-]
		"Opus 4.6 is available on the $20 plan too"

▲

komali2 2 hours ago | parent | prev [-]

I'm starting to think in these conversations we're all often talking about two different things. You're talking about running an LLM service through its provided tooling (codex, Claude, cursor), others seem to be talking token costs because they're integrating LLMs into software or are using harness systems like opencode, pi, or openclaw and balancing tasks across models.

▲

weird-eye-issue 2 hours ago | parent | next [-]

Fair enough, I read it quickly and assumed the person they replied to was talking about Claude Code

But I run a AI SaaS and we do offer Opus 4.6, too. Our use case is not nearly as token intensive as something like coding so we are still able to offer it with a good profit margin.

Also you can run OpenClaw with your CC subscription. It's what I do.

▲

BoorishBears an hour ago | parent | prev [-]

I wrap Opus 4.5 in a consumer product with 0 economic utility and people pay for it, I'm sure plenty of end users are willing to pay for it in their software.

Edit: I'm not using the term of art, I mean it literally cannot make them money.

	▲	eru an hour ago \| parent [-]
		> [...] in a consumer product with 0 economic utility and people pay for it, [...] Sorry, how do these two things go together? If people pay for it, it has economic utility, doesn't it? I mean, people pay to watch movies or play video games, too.

▲

XCSme 8 hours ago | parent | prev | next [-]

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

▲

rmi_ an hour ago | parent | next [-]

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

▲

XCSme an hour ago | parent [-]

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

▲

BoorishBears an hour ago | parent [-]

> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.

Yuck. At that point don't publish a benchmark, explains why their results are useless too.

Edit since I'm not able to reply to the below comment:

"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.

I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.

	▲	XCSme an hour ago \| parent [-]
		Why not? I described this in more detail in other comments. Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc. Most models get this right. Also, this is just one failure mode of Claude.

▲

usagisushi 5 hours ago | parent | prev | next [-]

Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

▲

wizee 6 hours ago | parent | prev | next [-]

It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

	▲	XCSme 5 hours ago \| parent \| next [-]
		I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.
	▲	5 hours ago \| parent \| prev [-]
		[deleted]

▲

comboy an hour ago | parent | prev [-]

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

	▲	XCSme an hour ago \| parent [-]
		Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse). The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

▲

miroljub 29 minutes ago | parent | prev | next [-]

> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.

I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.

> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.

I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.

> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.

What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.

When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.

Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.

The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.

	▲	tim-projects 8 minutes ago \| parent [-]
		I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference. If I was starting new projects I'd pay for a better model, but honestly I don't really know any different. I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good. When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?

▲

victorbjorklund an hour ago | parent | prev | next [-]

yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex

▲

m00x 3 hours ago | parent | prev | next [-]

Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.

They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

	▲	Leynos 2 hours ago \| parent \| next [-]
		Kimi is surprisingly good at Rust.
	▲	dvt 3 hours ago \| parent \| prev [-]
		> They're all slop when the complexity is higher than a mid-tech intermediate engineer though. This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.

▲

moffkalast an hour ago | parent | prev | next [-]

Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.

It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.

▲

smokel 43 minutes ago | parent [-]

And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.

	▲	moffkalast 40 minutes ago \| parent [-]
		No tooling, just manual use. When doing these comparisons I provide all models with all the data they need to figure out the problem, and paste the same thing into all so it's a pretty even eval. I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.

▲

mkw2000 4 hours ago | parent | prev | next [-]

i find kimi to be very very good, minimax not so much

▲

paulddraper 4 hours ago | parent | prev | next [-]

Agreed.

They are equivalent of frontier models 8+ months ago.

▲

AbanoubRodolf 5 hours ago | parent | prev [-]

[dead]