Opus 4.7 to 4.6 Inflation is ~45%

▲ Opus 4.7 to 4.6 Inflation is ~45%(tokens.billchambers.me)

284 points by anabranch 4 hours ago | 295 comments

▲ dakiol 2 hours ago | parent | next [-]

We dropped Claude. It's pretty clear this is a race to the bottom, and we don't want a hard dependency on another multi-billion dollar company just to write software

We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough

▲ ahartmetz 2 hours ago | parent | next [-]

>we don't want a hard dependency on another multi-billion dollar company just to write software

One of two main reasons why I'm wary of LLMs. The other is fear of skill atrophy. These two problems compound. Skill atrophy is less bad if the replacement for the previous skill does not depend on a potentially less-than-friendly party.

▲

post-it 2 hours ago | parent | next [-]

I was worried about skill atrophy. I recently started a new job, and from day 1 I've been using Claude. 90+% of the code I've written has been with Claude. One of the earlier tickets I was given was to update the documentation for one of our pipelines. I used Claude entirely, starting with having it generate a very long and thorough document, then opening up new contexts and getting it to fact check until it stopped finding issues, and then having it cut out anything that was granular/one query away. And then I read what it had produced.

It was an experiment to see if I could enter a mature codebase I had zero knowledge of, look at it entirely through an AI, and come to understand it.

And it worked! Even though I've only worked on the codebase through Claude, whenever I pick up a ticket nowadays I know what file I'll be editing and how it relates to the rest of the code. If anything, I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding.

▲

estetlinus an hour ago | parent | next [-]

Yeah, +1. I will never be working on unsolved problems anyhow. Skill atrophy is not happening if you stay curious and responsible.

▲

stringfood an hour ago | parent | next [-]

I have never learned so quickly in my entire life than to post a forum thread in its entirety into a extended think LLM and then be allowed to ask free form questions for 2 hours straight if I want to. Having my questions answered NOW is so important for me to learn. Back in the day by the time I found the answer online I forgot the question

	▲	lobf a few seconds ago \| parent [-]
		Same. I work in the film industry, but I’ve always been interested in computers and have enjoyed tinkering with them since I was about 5. However, coding has always been this insurmountably complicated thing- every time I make an effort to learn, I’m confronted with concepts that are difficult for me to understand and process. I’ve been 90% vibe coding for a year or so now, and I’ve learned so much about networking just from spinning up a bunch of docker containers and helping GPT or Claude fix niggling issues. I essentially have an expert (well, maybe not an expert but an entity far more capable than I am on my own) who’s shoulder I can look over and ask as many questions I want to, and who will explain every step of the process to me if I want. I’m finally able to create things on my computer that I’ve been dreaming about for years.

▲

idopmstuff an hour ago | parent | prev | next [-]

Some people talk like skill atrophy is inevitable when you use LLMs, which strikes me as pretty absurd given that you are talking about a tool that will answer an infinite number of questions with infinite patience.

I usually learn way more by having Claude do a task and then quizzing it about what it did than by figuring out how to do it myself. When I have to figure out how to do the thing, it takes much more time, so when I'm done I have to move on immediately. When Claude does the task in ten minutes I now have several hours I can dedicate entirely to understanding.

▲

onemoresoop an hour ago | parent [-]

You lose some, you win some. The win could be short-term much higher, however imagine that the new tool suddenly gets ragged pulled from under your feet. What do you do then? Do you still know how to handle it the old way or do you run into skill atrophy issues? I’m using Claude/Codex as well, but I’m a little worried that the environment we work in will become a lot more bumpy and shifty.

▲

visarga an hour ago | parent [-]

> the new tool suddenly gets ragged pulled from under your feet

If that happened at this point, it would be after societal collapse.

	▲	onemoresoop 34 minutes ago \| parent [-]
		I don’t even wanna think about that scenario, maybe he gets averted somehow.

▲

bdangubic 33 minutes ago | parent | prev [-]

I used to speak Russian like I was born in Russia. I stopped talking Russian … every day I am curious ans responsible but I can hardly say 10 words in Russian today. if you don’t use it (not just be curious and responsible) you will lose it - period.

	▲	thih9 3 minutes ago \| parent [-]
		Programming language is not just syntax, keywords and standard libraries, but also: processes, best practices and design principles. The latter group I guess is more difficult to learn and harder to forget.

▲

SpicyLemonZest an hour ago | parent | prev [-]

Are you sure you would know if it didn't work? I use Claude extensively myself, so I'm not saying this from a "hater" angle, but I had 2 people last week who believe themselves to be in your shoes send me pull requests which made absolutely no sense in the context of the codebase.

	▲	therealdrag0 an hour ago \| parent [-]
		That’s always been the case, AI or not.

▲

ljm 2 hours ago | parent | prev | next [-]

Not so much atrophy as apathy.

I've worked with people who will look at code they don't understand, say "llm says this", and express zero intention of learning something. Might even push back. Be proud of their ignorance.

It's like, why even review that PR in the first place if you don't even know what you're working with?

▲

psygn89 2 hours ago | parent | next [-]

I cringed when I saw a dev literally copy and paste an AI's response to a concern. The concern was one that had layers and implications to it, but instead of getting an answer as to why it was done a certain way and to allay any potential issues, that dev got a two paragraph lecture on how something worked on the surface of it, wrapped in em dashes and joviality.

A good dev would've read deeper into the concern and maybe noticed potential flaws, and if he had his own doubts about what the concern was about, would have asked for more clarification. Not just feed a concern into AI and fling it back. Like please, in this day and age of AI, have the benefit of the doubt that someone with a concern would have checked with AI himself if he had any doubts of his own concern...

▲

oremj an hour ago | parent | prev | next [-]

Is this the same subset of people who copy/paste code directly from stack overflow without understanding ? I’m not sure this is a new problem.

	▲	pizza234 an hour ago \| parent \| next [-]
		In my experience, no - I think the ability to build more complete features with less/little/no effort, rather than isolated functions, is (more) appealing to (more) developers.
	▲	dingaling 10 minutes ago \| parent \| prev \| next [-]
		It's difficult to copy & paste an entire app from Stack Overflow
	▲	foobarchu an hour ago \| parent \| prev \| next [-]
		It's a new problem in the sense that now executive management at many (if not most) software companies is pushing for all employees to work this way as much as possible. Those same people probably don't know what stack overflow even is.
	▲	malnourish an hour ago \| parent \| prev \| next [-]
		I don't think so. I'll spend a ton of time and effort thinking through, revising, and planning out the approach, but I let the agent take the wheel when it comes to transpiling that to code. I don't actually care about the code so long as it's secure and works. I spent years cultivating expertise in C++ and .NET. And I found that time both valuable and enjoyable. But that's because it was a path to solve problems for my team, give guidance, and do so with both breadth and depth. Now I focus on problems at a higher level of abstraction. I am certain there's still value in understanding ownership semantics and using reflection effectively, but they're broadly less relevant concerns.
	▲	sroussey 15 minutes ago \| parent \| prev \| next [-]
		Copied and pasted without noting the license that stack overflow has on code published there, no doubt
	▲	trinsic2 an hour ago \| parent \| prev [-]
		Hey. I resemble that remark sometimes!! quit being a hater (sarcasm) :P

▲

kilroy123 2 hours ago | parent | prev | next [-]

We've had such developers around, long before LLMs.

	▲	ohazi an hour ago \| parent [-]
		They're so much louder now, though.

▲

RexM an hour ago | parent | prev | next [-]

It’s a lot like someone bragging that they’re bad at math tossing around equations.

▲

monkpit an hour ago | parent | prev | next [-]

If I wanted to know what the LLM says, I would have asked it myself, thanks…

▲

redanddead an hour ago | parent | prev [-]

What is it in the broader culture that's causing this?

▲

mattgreenrocks an hour ago | parent | next [-]

These people have always existed. Hell, they are here, too. Now they have a new thing to delegate responsibility to.

And no, I don't understand them at all. Taking responsibility for something, improving it, and stewarding it into production is a fantastic feeling, and much better than reading the comment section. :)

▲

groundzeros2015 an hour ago | parent | prev [-]

People who got into the job who don’t really like programming

	▲	drivebyhooting an hour ago \| parent [-]
		I like programming, but I don’t like the job.

▲

tossandthrow 2 hours ago | parent | prev | next [-]

You can argu that you will have skill atrophy by not using LLMs.

We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.

I am learning at an incredible rate with LLMs.

▲

mgambati 2 hours ago | parent | next [-]

I kind feel the same. I’m learning things and doing things in areas that would just skip due to lack of time or fear.

But I’m so much more detached of the code, I don’t feel that ‘deep neural connection’ from actual spending days in locked in a refactor or debugging a really complex issue.

I don’t know how a feel about it.

▲

Fire-Dragon-DoL 2 hours ago | parent | next [-]

I strongly agree on the refactor, but for debugging I have another perspective: I think debugging is changing for the better, so it looks different.

Sure, you don't know the code by heart, but people debugging code translated to assembly already do that.

The big difference is being able to unleash scripts that invalidate enormous amount of hypothesis very fast and that can analyze the data.

Used to do that by hand it took hours, so it would be a last resort approach. Now that's very cheap, so validating many hypothesis is way cheaper!

I feel like my "debugging ability" in terms of value delivered has gone way up. For skill, it's changing. I cannot tell, but the value i am delivering for debugging sessions has gone way up

▲

afzalive 2 hours ago | parent | prev [-]

As someone who's switched from mobile to web dev professionally for the last 6 months now. If you care about code quality, you'll develop that neural connection after some time.

But if you don't and there's no PR process (side projects), the motivation to form that connection is quite low.

	▲	hombre_fatal 37 minutes ago \| parent [-]
		> If you care about code quality, you'll develop that neural connection after some time. No, because you can get LLMs to produce high quality code that has gone through an infinite number of refinement/polish cycles and is far more exhaustive than the code you would have written yourself. Once you hit that point, you find yourself in a directional/steering position divorced from the code since no matter what direction you take, you'll get high quality code.

▲

ori_b 2 hours ago | parent | prev | next [-]

Yes, you certainly can argue that, but you'd be wrong. The primary selling point of LLMs is that they solve the problem of needing skill to get things done.

▲

tossandthrow 2 hours ago | parent | next [-]

That is not the entire selling point - so you are very wrong.

You very much decide how you employ LLMs.

Nobody are keeping a gun to your head to use them. In a certain way.

Sonif you use them in a way that increase you inherent risk, then you are incredibly wrong.

▲

ori_b 2 hours ago | parent [-]

I suggest you read the sales pitches that these products have been making. Again, when I say that this is the selling point, I mean it: This is why management is buying them.

▲

SpicyLemonZest 2 hours ago | parent | next [-]

I've read the sales pitches, and they're not about replacing the need for skill. The Claude Design announcement from yesterday (https://www.anthropic.com/news/claude-design-anthropic-labs) is pretty typical in my experience. The pitch is that this is good for designers, because it will allow them to explore a much broader range of ideas and collaborate on them with counterparties more easily. The tool will give you cool little sliders to set the city size and arc width, but it doesn't explain why you would want to adjust these parameters or how to determine the correct values; that's your job.

I understand why a designer might read this post and not be happy about it. If you don't think your management values or appreciates design skill, you'd worry they're going to glaze over the bullet points about design productivity, and jump straight to the one where PMs and marketers can build prototypes and ignore you. But that's not what the sales pitch is focused on.

	▲	ori_b an hour ago \| parent [-]
		The majority of examples describe 'person without<skill> can do thing needing <skill>'.

▲

trinsic2 an hour ago | parent | prev [-]

Sales pitches dont mean jack, WTF are you talking about?

	▲	foobarchu 44 minutes ago \| parent [-]
		Sales pitches are literally the same thing as "the selling point". Neither of those is necessarily a synonym for why you personally use them

▲

andy_ppp 2 hours ago | parent | prev | next [-]

I see it completely the opposite way, you use an LLM and correct all its mistakes and it allows you to deliver a rough solution very quickly and then refine it in combination with the AI but it still gets completely lost and stuck on basic things. It’s a very useful companion that you can’t trust, but it’s made me 4-5x more productive and certainly less frustrated by the legacy codebase I work on.

▲

trinsic2 an hour ago | parent | prev | next [-]

Yeah I whole hardheartedly disagree with this. Because I understand the basics of coding I can understand where the model gets stuck and prompt it in other directions.

If you don't know whats going on through the whole process, good luck with the end product.

▲

Forgeties79 2 hours ago | parent | prev [-]

They purportedly solve the problem of needing skill to get things done. IME, this is usually repeated by VC backed LLM companies or people who haven’t knowingly had to deal with other people’s bad results.

This all bumps up against the fact that most people default to “you use the tool wrong” and/or “you should only use it to do things where you already have firm grasp or at least foundational knowledge.”

It also bumps against the fact that the average person is using LLM’s as a replacement for standard google search.

▲

Wowfunhappy 30 minutes ago | parent | prev | next [-]

> We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.

That’s product atrophy, not skill atrophy.

▲

weego 37 minutes ago | parent | prev | next [-]

You're learning at your standard rate of learning, you're just feeding yourself over-confidence on how much you're absorbing vs what the LLM is facilitating you rolling out.

	▲	tossandthrow 12 minutes ago \| parent [-]
		This is such a weird statement in so many levels. The latent assumption here is that learning is zero sum. That you can take a 30 year old from 1856 bring them into present day and they will learn whatever subject as fast as a present day 20 year old. That teachers doesn't matter. That engagement doesn't matter. Learning is not zero sum. Some cultural background makes learning easier, some mentoring makes is easier, and some techniques increases engagement in ways that increase learning speed.

▲

jjallen 2 hours ago | parent | prev | next [-]

Also AI could help you pick those skills up again faster, although you wouldn’t need to ever pick those skills up again unless AI ceased to exist.

What an interesting paradox-like situation.

	▲	estetlinus an hour ago \| parent [-]
		I believe some professor warned us about being over reliant on Google/reddit etc: “how would you be productive if internet went down” dilemma. Well, if internet is down, so is our revenue buddy. Engineering throughput would be the last of our concerns.

▲

deadbabe 2 hours ago | parent | prev | next [-]

Using LLMs as a learning tool isn’t what causes skill atrophy. It’s using them to solve entire problems without understanding what they’ve done.

And not even just understanding, but verifying that they’ve implemented the optimal solution.

▲

i_love_retros 2 hours ago | parent | prev | next [-]

>I am learning at an incredible rate with LLMs.

I don't believe it. Having something else do the work for you is not learning, no matter how much you tell yourself it is.

▲

margalabargala 2 hours ago | parent | next [-]

If you've seen further it's only because you've stood on the shoulders of giants.

Having other people do work for you is how people get to focus on things they actually care about.

Do you use a compiler you didn't write yourself? If so can you really say you've ever learned anything about computers?

	▲	butterisgood an hour ago \| parent [-]
		You have to build a computer to learn about computers!

▲

tossandthrow 2 hours ago | parent | prev [-]

It is easy to not believe if you only apply an incredibly narrow world view.

Open your eyes, and you might become a believer.

▲

nothinkjustai 2 hours ago | parent [-]

What is this, some sort of cult?

▲

subscribed 11 minutes ago | parent | next [-]

You mean the cult of "I can't see the viruses therefore they dint exist"? As in "I can't imagine something so it means it's a lie"?

Indeed, quite weird and no imagination.

▲

tossandthrow an hour ago | parent | prev [-]

No, it is an as snarky response to a person being snarky about usefulness of AI agents.

It does seem like there is a cult of people who categorically see LLMs as being poor at anything without it being founded in anything experience other than their 2023 afternoon to play around with it.

▲

nothinkjustai 32 minutes ago | parent [-]

Who cares? Why are people so invested in trying to “convert” others to see the light?

Can’t you be satisfied with outcompeting “non believers”? What motivates you to argue on the internet about it? Deep down are you insecure about your reliance on these tools or something, and want everyone else to be as well?

	▲	tossandthrow 10 minutes ago \| parent [-]
		Why do people invest themselves so hard in interjecting themselves into conversations about Ai telling people it doesn't work? It feels so off rebuilding serious SaaS apps in days for production, only to be told it is not possible?

▲

bluefirebrand 2 hours ago | parent | prev [-]

> I am learning at an incredible rate with LLMs

Could you do it again without the help of an LLM?

If no, then can you really claim to have learned anything?

▲

danw1979 2 hours ago | parent | next [-]

I think this is a bit dismissive.

It’s quite possible to be deep into solving a problem with an LLM guiding you where you’re reading and learning from what it says. This is not really that different from googling random blogs and learning from Stack Overflow.

Assuming everyone just sits there dribbling whilst Claude is in YOLO mode isn’t always correct.

▲

subscribed an hour ago | parent | prev | next [-]

>> I am learning a new skill with instructor at an incredible rate

> Could you do it again on your own?

Can you you see how nonsensical your stance is? You're straight up accusing GP of lying they are learning something at the increased rate OR suggesting if they couldn't learn that, presumably at the same rate, on they own, they're not learning anything.

That's not very wise to project your own experiences on others.

	▲	sroussey 3 minutes ago \| parent [-]
		Actually, it’s much like taking a physics or engineering course, and after the class being fully able to explain the class that day, and yet realize later when you are doing the homework that you did not actually fully understand like you thought you did.

▲

tossandthrow 2 hours ago | parent | prev | next [-]

I could definitely maintain the infrastructure without an llm. Albeit much slower.

And yes. If LLMs disappear, then we need to hire a lot of people to maintain the infrastructure.

Which naturally is a part of the risk modeling.

▲

bluefirebrand an hour ago | parent [-]

> I could definitely maintain the infrastructure without an llm

Not what I asked, but thanks for playing.

	▲	tossandthrow 9 minutes ago \| parent [-]
		You literally asked that question > Could you do it again without the help of an LLM?

▲

_blk 2 hours ago | parent | prev [-]

The challenge is not if you could do all of it without AI but any of it that you couldn't before.

Not everyone learns at the same pace and not everyone has the same fault tolerance threshold. In my experiencd some people are what I call "Japanese learners" perfecting by watching. They will learn with AI but would never do it themselves out of fear of getting something wrong while they understand most of it, others that I call "western learners" will start right away and "get their hands dirty" without much knowledge and also get it wrong right away. Both are valid learning strategies fitting different personalities.

▲

solarengineer an hour ago | parent | prev [-]

https://hex.ooo/library/power.html

When future humans rediscover mathematics.

▲ leonidasv 21 minutes ago | parent | prev | next [-]

>perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs

I fear that this may not be feasible in the long term. The open-model free ride is not guaranteed to continue forever; some labs offer them for free for publicity after receiving millions in VC grants now, but that's not a sustainable business model. Models cost millions/billions in infrastructure to train. It's not like open-source software where people can just volunteer their time for free; here we are talking about spending real money upfront, for something that will get obsolete in months.

Current AI model "production" is more akin to an industrial endeavor than open-source arrangements we saw in the past. Until we see some breakthrough, I'm bearish on "open models will eventually save us from reliance on big companies".

▲ dgellow 2 hours ago | parent | prev | next [-]

Another aspect I haven’t seen discussed too much is that if your competitor is 10x more productive with AI, and to stay relevant you also use AI and become 10x more productive. Does the business actually grow enough to justify the extra expense? Or are you pretty much in the same state as you were without AI, but you are both paying an AI tax to stay relevant?

▲

xixixao an hour ago | parent | next [-]

This is the “ad tax” reasoning, but ultimately I think the answer is greater efficiency. So there is a real value, even if all competitors use the tools.

It’s like saying clothing manufacturers are paying the “loom tax” tax when they could have been weaving by hand…

▲

SlinkyOnStairs an hour ago | parent | next [-]

Software development is not a production line, the relationship between code output and revenue is extremely non-linear.

Where producing 2x the t-shirts will get you ~2x the revenue, it's quite unlikely that 10x the code will get you even close to 2x revenue.

With how much of this industry operates on 'Vendor Lock-in' there's a very real chance the multiplier ends up 0x. AI doesn't add anything when you can already 10x the prices on the grounds of "Fuck you. What are you gonna do about it?"

	▲	groundzeros2015 an hour ago \| parent [-]
		Yep and in a vendor lock in scenario, fixing deep bugs or making additions in surgical ways is where the value is. And Claude helps you do that, by giving you more information, analyzing options, but it doesn’t let you make that decision 10x faster.

▲

bigbadfeline 28 minutes ago | parent | prev [-]

We already know how to multiply the efficiency of human intelligence to produce better quality than LLMs and nearly match their productivity - open source - in fact coding LLMs wouldn't even exist without it.

Open source libraries and projects together with open source AI is the only way to avoid the existential risks of closed source AI.

▲

dakiol 13 minutes ago | parent | prev | next [-]

Where's the evidence of competitors being 10x more productive? So far, everyone is simply bragging about how much code they have shipped last week, but that has zero relevance when it comes to productivity

▲

redanddead an hour ago | parent | prev | next [-]

The alternative is probably also true. If your F500 competitor is also handicapped by AI somehow, then you're all stagnant, maybe at different levels. Meanwhile Anthropic is scooping up software engineers it supposedly made irrelevant with Mythos and moving into literally 2+ new categories per quarter

▲

JambalayaJimbo an hour ago | parent | prev | next [-]

If the business doesn’t grow then you shed costs like employees

▲

Lihh27 an hour ago | parent | prev | next [-]

it's worse than a tie. 10x everyone just floods the market and tanks per-unit price. you pay the AI tax and your output is worth less.

▲

senordevnyc an hour ago | parent | prev [-]

Either the business grows, or the market participants shed human headcount to find the optimal profit margin. Isn’t that the great unknown: what professions are going to see headcount reduction because demand can’t grow that fast (like we’ve seen in agriculture), and which will actually see headcount stay the same or even expand, because the market has enough demand to keep up with the productivity gains of AI? Increasingly I think software writ large is the latter, but individual segments in software probably are the former.

▲ somewhereoutth 2 minutes ago | parent | prev | next [-]

My understanding is that the major part of the cost of a given model is the training - so open models depend on the training that was done for frontier models? I'm finding hard to imagine (e.g.) RLHF being fundable through a free software type arrangement.

▲ dewarrn1 2 hours ago | parent | prev | next [-]

I'm hopeful that new efficiencies in training (Deepseek et al.), the impressive performance of smaller models enhanced through distillation, and a glut of past-their-prime-but-functioning GPUs all converge make good-enough open/libre models cheap, ubiquitous, and less resource-intensive to train and run.

▲ michaelje an hour ago | parent | prev | next [-]

Open models keep closing the eval gap for many tasks, and local inference continues to be increasingly viable. What's missing isn't technical capability, but productized convenience that makes the API path feel like the only realistic option.

Frontier labs are incentivized to keep it that way, and they're investing billions to make AI = API the default. But that's a business model, not a technical inevitability.

▲ tossandthrow 2 hours ago | parent | prev | next [-]

The lock in is so incredibly poor. I could switch to whatever provider in minuets.

But it requires that one does not do something stupid.

Eg. For recurring tasks: keep the task specification in the source code and just ask Claude to execute it.

The same with all documentation, etc.

▲ aliljet 2 hours ago | parent | prev | next [-]

What open models are truly competing with both Claude Code and Opus 4.7 (xhigh) at this stage?

▲

parinporecha an hour ago | parent | next [-]

I've had a good experience with GLM-5.1. Sure it doesn't match xhigh but comes close to 4.6 at 1/3rd the cost

▲

esafak an hour ago | parent | prev | next [-]

GLM 5.1 competes with Sonnet. I'm not confident about Opus, though they claim it matches that too.

	▲	ojosilva 42 minutes ago \| parent [-]
		I have it as failover to Opus 4.6 in a Claude proxy internally. People don't notice a thing when it triggers, maybe a failed tool call here and there (harness remains CC not OC) or a context window that has gone over 200k tokens or an image attachment that GLM does not handle, otherwise hunky-dory all the way. I would also use it as permanent replacement for haiku at this proxy to lower Claude costs but have not tried it yet. Opus 4.7 has shaken our setup badly and we might look into moving to Codex 100% (GLM could remain useful there too).

▲

Someone1234 2 hours ago | parent | prev [-]

That's a lame attitude. There are local models that are last year's SOTA, but that's not good enough because this year's SOTA is even better yet still...

I've said it before and I'll say it again, local models are "there" in terms of true productive usage for complex coding tasks. Like, for real, there.

The issue right now is that buying the compute to run the top end local models is absurdly unaffordable. Both in general but also because you're outbidding LLM companies for limited hardware resources.

You have a $10K budget, you can legit run last year's SOTA agentic models locally and do hard things well. But most people don't or won't, nor does it make cost effective sense Vs. currently subsidized API costs.

▲

HWR_14 3 minutes ago | parent | next [-]

$10k is a lot of tokens.

▲

gbro3n an hour ago | parent | prev | next [-]

I completely see your point, but when my / developer time is worth what it is compared to the cost of a frontier model subscription, I'm wary of choosing anything but the best model I can. I would love to be able to say I have X technique for compensating for the model shortfall, but my experience so far has been that bigger, later models out perform older, smaller ones. I genuinely hope this changes through. I understand the investment that it has taken to get us to this point, but intelligence doesn't seem like it's something that should be gated.

▲

Someone1234 an hour ago | parent | next [-]

Right; but every major generation has had diminishing returns on the last. Two years ago the difference was HUGE between major releases, and now we're discussing Opus 4.6 Vs. 4.7 and people cannot seem to agree if it is an improvement or regression (and even their data in the card shows regressions).

So my point is: If you have the attitude that unless it is the bleeding edge, it may have well not exist, then local models are never going to be good enough. But truth is they're now well exceeding what they need to be to be huge productivity tools, and would have been bleeding edge fairly recently.

	▲	gbro3n an hour ago \| parent [-]
		I feel like I'm going to have to try the next model. For a few cycles yet. My opinion is that Opus 4.7 is performing worse for my current work flow, but 4.6 was a significant step up, and I'd be getting worse results and shipping slower if I'd stuck with 4.5. The providers are always going to swear that the latest is the greatest. Demis Hassabis recently said in an interview that he thinks the better funded projects will continue to find significant gains through advanced techniques, but that open source models figure out what was changed after about 6 months or so. We'll see I guess. Don't get me wrong, I'd love to settle down with one model and I'd love it to be something I could self host for free.

▲

dakiol 11 minutes ago | parent | prev [-]

> I completely see your point, but when my / developer time is worth what it is compared to the cost of a frontier model subscription, I'm wary of choosing anything but the best model I can.

Don't you understand that by choosing the best model we can, we are, collectively, step by step devaluating what our time is worth? Do you really think we all can keep our fancy paychecks while keep using AI?

▲

aliljet an hour ago | parent | prev | next [-]

First, making sure to offer an upvote here. I happen to be VERY enthusiastic about local models, but I've found them to be incredibly hard to host, incredibly hard to harness, and, despite everything, remarkably powerful if you are willing to suffer really poor token/second performance...

▲

wellthisisgreat 36 minutes ago | parent | prev [-]

> that are last year's SOTA

Early last year or late last year?

opus 4.5 was quite a leap

▲ GaryBluto 2 hours ago | parent | prev | next [-]

> open models

Google just released Gemma 4, perhaps that'd be worth a try?

▲ giancarlostoro 33 minutes ago | parent | prev | next [-]

> I think that's the way forward. Actually it would be great if everybody would put more focus on open models,

I'm still surprised top CS schools are not investing in having their students build models, I know some are, but like, when's the last time we talked about a model not made by some company, versus a model made by some college or university, which is maintained by the university and useful for all.

It's disgusting that OpenAI still calls itself "Open AI" when they aren't truly open.

▲ ben8bit 2 hours ago | parent | prev | next [-]

Any recommendations on good open ones? What are you using primarily?

▲ culi 2 hours ago | parent | next [-]

LMArena actually has a nice Pareto distribution of ELO vs price for this

  model                        elo   $/M
  ---------------------------------------
  glm-5.1                      1538  2.60
  glm-4.7                      1440  1.41
  minimax-m2.7                 1422  0.97
  minimax-m2.1-preview         1392  0.78
  minimax-m2.5                 1386  0.77
  deepseek-v3.2-thinking       1369  0.38
  mimo-v2-flash (non-thinking) 1337  0.24

https://arena.ai/leaderboard/code?viewBy=plot&license=open-s...

	▲	logicprog 12 minutes ago \| parent [-]
		LMArena isn't very useful as a benchmark, however I can vouch for the fact that GLM 5.1 is astonishingly good. Several people I know who have a $100/mo Claude Code subscription are considering cancelling it and going all in on GLM, because it's finally gotten (for them) comparable to Opus 4.5/6. I don't use Opus myself, but I can definitely say that the jump from the (imvho) previous best open weight model Kimi K2.5 to this is otherworldly — and K2.5 was already a huge jump itself!

▲ blahblaher 2 hours ago | parent | prev | next [-]

qwen3.5/3.6 (30B) works well,locally, with opencode

▲

zozbot234 2 hours ago | parent | next [-]

Mind you, a 30B model (3B active) is not going to be comparable to Opus. There are open models that are near-SOTA but they are ~750B-1T total params. That's going to require substantial infrastructure if you want to use them agentically, scaled up even further if you expect quick real-time response for at least some fraction of that work. (Your only hope of getting reasonable utilization out of local hardware in single-user or few-users scenarios is to always have something useful cranking in the background during downtime.)

▲

pitched 2 hours ago | parent | next [-]

For a business with ten or more engineers/people-using-ai, it might still make sense to set this up. For an individual though, I can’t imagine you’d make it through to positive ROI before the hardware ages out.

	▲	zozbot234 2 hours ago \| parent \| next [-]
		It's hard to tell for sure because the local inference engines/frameworks we have today are not really that capable. We have barely started exploring the implications of SSD offload, saving KV-caches to storage for reuse, setting up distributed inference in multi-GPU setups or over the network, making use of specialty hardware such as NPUs etc. All of these can reuse fairly ordinary, run-of-the-mill hardware.
	▲	DeathArrow an hour ago \| parent \| prev [-]
		Since you need at least a few of H100 class hardware, I guess you need at least few tens of coders to justify the costs.

▲

wuschel an hour ago | parent | prev | next [-]

What near SOTA open models are you referring to?

▲

cyberax an hour ago | parent | prev [-]

I'm backing up a big dataset onto tapes, so I wanted to automate it. I have an idle 64Gb VRAM setup in my basement, so I decided to experiment and tasked it with writing an LTFS implementation. LTFS is an open standard for filesystems for tapes, and there's an implementation in C that can be used as the baseline.

So far, Qwen 3.6 created a functionally equivalent Golang implementation that works against the flat file backend within the last 2 days. I'm extremely impressed.

▲

pitched 2 hours ago | parent | prev | next [-]

I want to bump this more than just a +1 by recommending everyone try out OpenCode. It can still run on a Codex subscription so you aren’t in fully unfamiliar territory but unlocks a lot of options.

	▲	zozbot234 2 hours ago \| parent \| next [-]
		The Codex TUI harness is also open source and you can use open models with it, so you can stay in even more familiar territory.
	▲	pwython 2 hours ago \| parent \| prev [-]
		pi-coding-agent (pi.dev) is also great. I've been using it with Gemma 4 and Qwen 3.6.

▲

jherdman 2 hours ago | parent | prev | next [-]

Is this sort of setup tenable on a consumer MBP or similar?

▲

danw1979 2 hours ago | parent | next [-]

Qwen’s 30B models run great on my MBP (M4, 48GB) but the issue I have is cooling - the fan exhaust is straight onto the screen, which I can’t help thinking will eventually degrade it, given the thermal cycling it would go through. A Mac Studio makes far more sense for local inference just for this reason alone.

▲

pitched 2 hours ago | parent | prev [-]

For a 30B model, you want at least 20GB of VRAM and a 24GB MBP can’t quite allocate that much of it to VRAM. So you’d want at least a 32GB MBP.

▲

richardfey an hour ago | parent | next [-]

I have 24GB VRAM available and haven't yet found a decent model or combination. Last one I tried is Qwen with continue, I guess I need to spend more time on this.

▲

zozbot234 2 hours ago | parent | prev | next [-]

It's a MoE model so I'd assume a cheaper MBP would simply result in some experts staying on CPU? And those would still have a sizeable fraction of the unified memory bandwidth available.

	▲	pitched 2 hours ago \| parent [-]
		I haven’t tried this myself yet but you would still need enough non-vram ram available to the cpu to offload to cpu, right? This is a fully novice question, I have not ever tried it.

▲

_blk an hour ago | parent | prev [-]

Is there any model that practically compares to Sonnet 4.6 in code and vision and runs on home-grade (12G-24G) cards?

▲

cpursley 2 hours ago | parent | prev [-]

How are you running it with opencode, any tips/pointers on the setup?

▲ DeathArrow an hour ago | parent | prev | next [-]

I am using GLM 5.1 and MiniMax 2.7.

▲ cmrdporcupine 2 hours ago | parent | prev [-]

GLM 5.1 via an infra provider. Running a competent coding capable model yourself isn't viable unless your standards are quite low.

▲

myaccountonhn an hour ago | parent [-]

What infra providers are there?

	▲	elbear an hour ago \| parent [-]
		There's DeepInfra. There's also OpenRouter where you can find several providers.

▲ Frannky an hour ago | parent | prev | next [-]

Opencode go with open models is pretty good

▲ sergiotapia an hour ago | parent | prev | next [-]

I can recommend this stack. It works well with the existing Claude skills I had in my code repos:

1. Opencode

2. Fireworks AI: GLM 5.1

And it is SIGNIFICANTLY cheaper than Claude. I'm waiting eagerly for something new from Deepseek. They are going to really show us magic.

▲

dirasieb an hour ago | parent [-]

it is also significantly less capable than claude

	▲	dakiol 8 minutes ago \| parent [-]
		That's fine. When the "best of the best" is offered only by a couple of companies that are not looking into our best interests, then we can discard them

▲ i_love_retros 2 hours ago | parent | prev | next [-]

> we don't want a hard dependency on another multi-billion dollar company just to write software

My manager doesn't even want us to use copilot locally. Now we are supposed to only use the GitHub copilot cloud agent. One shot from prompt to PR. With people like that selling vendor lock in for them these companies like GitHub, OpenAI, Anthropic etc don't even need sales and marketing departments!

▲

tossandthrow 2 hours ago | parent [-]

You are aware that using eg. Github copilot is not one shot? It will start an agentic loop.

▲

dgellow 2 hours ago | parent [-]

Unnecessary nitpicking

	▲	tossandthrow an hour ago \| parent [-]
		Why? One shoting has a very specific meaning, and agentic workflows are not it? What is the implied meaning I should understand from them using one shot? They might refer to the lack of humans in the loop.

▲ DeathArrow an hour ago | parent | prev | next [-]

>perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company

Training and inference costs so we would have to pay for them.

	▲	groundzeros2015 an hour ago \| parent [-]
		Developing linux/postgres/git also costs, and so do the computers and electricity they use.

▲ SilverElfin an hour ago | parent | prev [-]

Is that why they are racing to release so many products? It feels to me like they want to suck up the profits from every software vertical.

	▲	Bridged7756 an hour ago \| parent [-]
		Yeah it seems so. Anthropic has entered the enshittification phase. They got people hooked onto their SOTAs so it's now time to keep releasing marginal performance increase models at 40% higher token price. The problem is that both Anthropic and OpenAI have no other income other than AI. Can't Google just drown them out with cheaper prices over the long run? It seems like an attrition battle to me.

▲ hgoel 2 hours ago | parent | prev | next [-]

The bump from 4.6 to 4.7 is not very noticeable to me in improved capabilities so far, but the faster consumption of limits is very noticeable.

I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.

It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.

For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.

▲

_blk 2 hours ago | parent [-]

From what I understand you shouldn't wait more than 5min between prompts without compacting or clearing or you'll pay for reinitializing the cache. With compaction you still pay but it's less input tokens. (Is compaction itself free?)

	▲	conception an hour ago \| parent \| next [-]
		Yeah the caching change is probably 90% of “i run out of usage so fast now!” Issues.
	▲	hgoel an hour ago \| parent \| prev [-]
		Ah I can see how my phrasing might be misleading, but these prompts were made within 5 minutes of each other, the timing I mentioned were what Claude spent working.

▲ andai an hour ago | parent | prev | next [-]

For a fair comparison you need to look at the total cost, because 4.7 produces significantly fewer output tokens than 4.6, and seems to cost significantly less on the reasoning side as well.

Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:

https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...

Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.

I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.

On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)

▲ bertil 28 minutes ago | parent | prev | next [-]

My impression is that the quality of the conversation is unexpectedly better: more self-critical, the suggestions are always critical, the default choices constantly best. I might not have as many harnesses as most people here, so I suspect it’s less obvious but I would expect this to make it far more valuable for people who haven’t invested as much.

After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.

▲ cooldk a minute ago | parent | prev | next [-]

Anthropic may have its biases, but its product is undeniably excellent.

▲ kalkin 3 hours ago | parent | prev | next [-]

AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.

▲

h14h 2 hours ago | parent | next [-]

For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:

https://artificialanalysis.ai/?intelligence-efficiency=intel...

Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.

	▲	theptip 10 minutes ago \| parent [-]
		This is the right way of thinking end-to-end. Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.

▲

SkyPuncher 2 hours ago | parent | prev | next [-]

Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.

I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.

▲

the_gipsy 2 hours ago | parent | prev | next [-]

With AIs, it seems like there never is a comparison that is useful.

	▲	jascha_eng an hour ago \| parent [-]
		yup its all vibes. And anthropic is winning on those in my book still

▲

manmal 3 hours ago | parent | prev [-]

Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.

▲

dktp 2 hours ago | parent | next [-]

The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall

▲

manmal 2 hours ago | parent [-]

Well you‘ll need the same prompt for input tokens?

▲

httgbgg an hour ago | parent [-]

Only the first one. Ideally now there is no second prompt.

	▲	manmal an hour ago \| parent [-]
		Are you aware that every tool call produces output which also counts as input to the LLM?

▲

kalkin 2 hours ago | parent | prev [-]

That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".

▲ someuser54541 3 hours ago | parent | prev | next [-]

Should the title here be 4.6 to 4.7 instead of the other way around?

▲

freak42 3 hours ago | parent | next [-]

absolutely!

▲

UltraSane 3 hours ago | parent | prev [-]

Writing Opus 4.6 to 4.7 does make more sense for people who read left to right.

▲

pixelatedindex 3 hours ago | parent | next [-]

I’m impressed with anyone who can read English right to left.

▲

jlongman 2 hours ago | parent | next [-]

You might like https://en.wikipedia.org/wiki/Boustrophedon

	▲	amulyabaral an hour ago \| parent [-]
		Whoa! TIL! I struggled a bit to read this style at first, but felt it get easier after a few tries.

▲

einpoklum 2 hours ago | parent | prev [-]

Right to Left English - read can, who? Anyone with [which] impressed am I.

	▲	y1n0 2 hours ago \| parent [-]
		Yoda, you that is?

▲

embedding-shape 3 hours ago | parent | prev | next [-]

But the page is not in a language that should be read right to left, doesn't that make that kind of confusing?

▲

usrnm 2 hours ago | parent [-]

Did you mean "right to left"?

▲

embedding-shape 2 hours ago | parent [-]

I very much did, it got too confusing even for me. Thanks!

	▲	UltraSane 19 minutes ago \| parent [-]
		I kept mentally verifying that English is written left to right.

▲

bee_rider 2 hours ago | parent | prev [-]

Err, how so?

▲ gsleblanc 2 hours ago | parent | prev | next [-]

It's increasingly looking naive to assume scaling LLMs is all you need to get to full white-collar worker replacement. The attention mechanism / hopfield network is fundamentally modeling only a small subset of the full human brain, and all the increasing sustained hype around bolted-on solutions for "agentic memory" is, in my opinion, glaring evidence that these SOTA transformers alone aren't sufficient even when you just limit the space to text. Maybe I'm just parroting Yann LeCun.

▲

aerhardt an hour ago | parent | next [-]

> you just limit the space to text

And even then... why can't they write a novel? Or lowering the bar, let's say a novella like Death in Venice, Candide, The Metamorphosis, Breakfast at Tiffany's...?

Every book's in the training corpus...

Is it just a matter of someone not having spent a hundred grand in tokens to do it?

▲

zozbot234 2 minutes ago | parent | next [-]

Never mind novels, it can't even write a good Reddit-style or HN-style comment. agentalcove.ai has an archive of AI models chatting to one another in "forum" style and even though it's a good show of the models' overall knowledge the AIisms are quite glaring.

▲

voxl an hour ago | parent | prev | next [-]

I know someone spending basically every day writing personal fan fiction stories using every model you can find. She doesn't want to share it, and does complain about it a lot, seems like maintaining consistency for something say 100 pages long is difficult

▲

conception an hour ago | parent | prev | next [-]

I don’t understand - there are hundreds/thousands of AI written books available now.

	▲	aerhardt an hour ago \| parent [-]
		I've glossed over a few and one can immediately tell they don't meet the average writing level you'd see in a local workshop for writers, and much less that of Mann or Capote.

▲

colechristensen an hour ago | parent | prev [-]

Who says they can't? What's your bar that needs to be passed in order for "written a novella" to be achieved?

There's a lot of bad writing out there, I can't imagine nobody has used an LLM to write a bad novella.

▲

aerhardt an hour ago | parent [-]

> What's your bar that needs to be passed

I provide four examples in my comment...

▲

colechristensen an hour ago | parent [-]

Your qualification for if an LLM can write a novella is it has to be as good as The Metamorphosis?

Yes, those are examples of novellas, surely you believe an LLM could write a bad novella? I'm not sure what your point is. Either you think it can't string the words together in that length or your standard is it can't write a foundational piece of literature that stays relevant for generations... I'm not sure which.

▲

aerhardt an hour ago | parent [-]

I don't think it can write something that's of a fraction of the quality of Kafka.

But GP's argument ("limit the space to text") could be taken to imply - and it seems to be a common implication these days - that LLMs have mastered the text medium, or that they will very soon.

> it can't write a foundational piece of literature

Why not, if this a pure textual medium, the corpus includes all the great stories ever written, and possibly many writing workshops and great literature courses?

▲

colechristensen 34 minutes ago | parent [-]

I don't know what to tell you. It's more than a little absurd to make the qualification of being able to do something to be that the output has to be considered a great work of art for generations.

	▲	aerhardt 15 minutes ago \| parent [-]
		I agree that the argument starts from a reduction to the absurd. So at least we can agree that AI hasn't mastered the text medium, without further qualification? And what about my argument, further qualified, which is that I don't think it could even write as well as a good professional writer - not necessarily a generational one?

▲

ACCount37 38 minutes ago | parent | prev [-]

You probably are.

The "small subset" argument is profoundly unconvincing, and inconsistent with both neurobiology of the human brain and the actual performance of LLMs.

The transformer architecture is incredibly universal and highly expressive. Transformers power LLMs, video generator models, audio generator models, SLAM models, entire VLAs and more. It not a 1:1 copy of human brain, but that doesn't mean that it's incapable of reaching functional equivalence. Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.

LeCun's arguments about "LLMs can't do X" keep being proven wrong empirically. Even on ARC-AGI-3, which is a benchmark specifically designed to be adversarial to LLMs and target the weakest capabilities of off the shelf LLMs, there is no AI class that beats LLMs.

▲

bigyabai 9 minutes ago | parent [-]

> Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.

The human brain is not a pretrained system. It's objectively more flexible than than transformers and capable of self-modulation in ways that no ML architecture can replicate (that I'm aware of).

▲

ACCount37 5 minutes ago | parent [-]

Human brain's "pre-training" is evolution cramming way too much structure into it. It "learns from scratch" the way it does because it doesn't actually learn from scratch.

I've seen plenty of wacky test-time training things used in ML nowadays, which is probably the closest to how the human brain learns. None are stable enough to go into the frontier LLMs, where in-context learning still reigns supreme. In-context learning is a "good enough" continuous learning approximatation, it seems.

	▲	bigyabai a few seconds ago \| parent [-]
		> In-context learning is a "good enough" continuous learning approximatation, it seems. "it seems" is doing a herculean effort holding your argument up, in this statement. Say, how many "R"s are in Strawberry?

▲ glerk 2 hours ago | parent | prev | next [-]

I'd be ok with paying more if results were good, but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.

These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.

▲

Bridged7756 an hour ago | parent | next [-]

Mirrors my sentiment. Those tools seem mostly useful for a Google alternative, scaffolding tedious things, code reviewing, and acting as a fancy search.

It seems that they got a grip on the "coding LLM" market and now they're starting to seek actual profit. I predict we'll keep seeing 40%+ more expensive models for a marginal performance gain from now on.

	▲	danny_codes an hour ago \| parent [-]
		I just don’t see how they’ll be able to make a profit. Open models have the same performance on coding tasks now. The incentives are all wrong. Why pay more for a model that’s no better and also isn’t open? It’s nonsense

▲

xpe an hour ago | parent | prev [-]

> ... but it seems like Anthropic is going for the Tinder/casino intermittent reinforcement strategy: optimized to keep you spending tokens instead of achieving results.

This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.

My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)

What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.

I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?

How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.

There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.

Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.

I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.

	▲	glerk 18 minutes ago \| parent [-]
		> To claim to know a company's strategy as an outsider is messy stuff. I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned. Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).

▲ rectang 2 hours ago | parent | prev | next [-]

For now, I'm planning to stick with Opus 4.5 as a driver in VSCode Copilot.

My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.

Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.

Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.

But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.

▲ tiffanyh 2 hours ago | parent | prev | next [-]

I was using Opus 4.7 just yesterday to help implement best practices on a single page website.

After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.

The entire HTMl/CSS/JS was less than 300 lines of code.

I was shocked how fast it exhausted my usage limits.

▲

zaptrem an hour ago | parent | next [-]

What's your reasoning effort set to? Max now uses way more tokens and isn't suggested for most usecases. Even the new default (xhigh) uses more than the old default (medium).

▲

hirako2000 2 hours ago | parent | prev | next [-]

I haven't used Claude. Because I suspect this sort of things to come.

With enterprise subscription, the bill gets bigger but it's not like VP can easily send a memo to all its staff that a migration is coming.

Individuals may end their subscription, that would appease the DC usage, and turn profits up.

▲

sync 2 hours ago | parent | prev | next [-]

Which plan are you on? I could see that happening with Pro (which I think defaults to Sonnet?), would be surprised with Max…

	▲	templar_snow 2 hours ago \| parent \| next [-]
		It eats even the Max plan like crazy.
	▲	tiffanyh 2 hours ago \| parent \| prev [-]
		Pro. It even gave me $20 free credits, and exhausted free credits nearly instantly.

▲

tomtomistaken 2 hours ago | parent | prev [-]

Are you using Claude subscription? Because that's not how it works there.

▲ couchdb_ouchdb an hour ago | parent | prev | next [-]

Comments here overall do not reflect my experience -- i'm puzzled how the vast majority are using this technology day to day. 4.7 is absolute fire and an upgrade on 4.6.

▲ hereme888 an hour ago | parent | prev | next [-]

> Opus 4.7 (Adaptive Reasoning, Max Effort) cost ~$4,406 to run the Artificial Analysis Intelligence Index, ~11% less than Opus 4.6 (Adaptive Reasoning, Max Effort, ~$4,970) despite scoring 4 points higher. This is driven by lower output token usage, even after accounting for Opus 4.7's new tokenizer. This metric does not account for cached input token discounts, which we will be incorporating into our cost calculations in the near future.

▲ autoconfig 2 hours ago | parent | prev | next [-]

My initial experience with Opus 4.7 has been pretty bad and I'm sticking to Codex. But these results are meaningless without comparing outcome. Wether the extra token burn is bad or not depends on whether it improves some quality / task completion metric. Am I missing something?

	▲	zuzululu 2 hours ago \| parent [-]
		Same I was excited about 4.7 but seeing more anecdotes to conclude its not big of a boost to justify the extra tokenflatino Sticking with codex. Also GPT 5.5 is set to come next week.

▲ templar_snow 2 hours ago | parent | prev | next [-]

Brutal. I've been noticing that 4.7 eats my Max Subscription like crazy even when I do my best to juggle tasks (or tell 4.7 to use subagents with) Sonnet 4.6 Medium and Haiku. Would love to know if anybody's found ideal token-saving approaches.

▲

copperx 2 hours ago | parent [-]

I haven't seen a noticeable difference BUT I've been always using the context mode plugin.

▲

templar_snow 25 minutes ago | parent | next [-]

You mean this? https://github.com/mksglu/context-mode Is it actually good or is this an ad?

▲

FireBeyond an hour ago | parent | prev [-]

What plugin is this?

	▲	vidarh an hour ago \| parent [-]
		I assume they mean: https://github.com/mksglu/context-mode

▲ throwatdem12311 an hour ago | parent | prev | next [-]

Price is now getting to be more in line with the actual cost. Th models are dumber, slower and more expensive than what we’ve been paying up until now. OpenAI will do it too, maybe a bit less to avoid pissing people off after seeing backlash to Anthropic’s move here. Or maybe they won’t make it dumber but they’ll increase the price while making a dumber mode the baseline so you’re encouraged to pay more. Free ride is over. Hope you have 30k burning a hole in your pocket to buy a beefy machine to run your own model. I hear Mac Studios are good for local inference.

▲ tailscaler2026 3 hours ago | parent | prev | next [-]

Subsidies don't last forever.

▲

pitched 2 hours ago | parent | next [-]

Running an open like Kimi constantly for an entire month will cost around 100-200$, being roughly equal to a pro-tier subscription. This is not my estimate so I’m more than open to hearing refutations. Kimi isn’t at all Opus-level intelligent but the models are roughly evenly sized from the guesses I’ve seen. So I don’t think it’s the infra being subsidized as much as it’s the training.

	▲	nothinkjustai 2 hours ago \| parent \| next [-]
		Kimi costs 0.3/$1.72 on OpenRouter, $200 for that gives you way more than you would get out of a $200 Claude subscription. There are also various subscription plans you can use to spend even less.
	▲	varispeed an hour ago \| parent \| prev \| next [-]
		How do you get anything sensible out of Kimi?
	▲	senordevnyc an hour ago \| parent \| prev [-]
		I’m using Composer 2, Cursor’s model they built on top of Kimi, and it’s great. Not Opus level, but I’m finding many things don’t need Opus level.

▲

smt88 2 hours ago | parent | prev | next [-]

Tell that to oil and defense companies.

If tech companies convince Congress that AI is an existential issue (in defense or even just productivity), then these companies will get subsidies forever.

▲

andai 2 hours ago | parent [-]

Yeah, USA winning on AI is a national security issue. The bubble is unpoppable.

And shafting your customers too hard is bad for business, so I expect only moderate shafting. (Kind of surprised at what I've been seeing lately.)

	▲	danny_codes 43 minutes ago \| parent [-]
		It’s considered national security concern by this administration. Will the next be a clown show like this one? Unclear

▲

gadflyinyoureye 2 hours ago | parent | prev [-]

I've been assuming this for a while. If I have a complex feature, I use Opus 4.6 in copilot to plan (3 units of my monthly limit). Then have Grok or Gemini (.25-.33) of my monthly units to implement and verify the work. 80% of the time it works every time. Leave me plenty of usage over the month.

▲

andai 2 hours ago | parent [-]

Yeah I've been arriving at the same thing. The other models give me way more usage but they don't seem to have enough common sense to be worth using as the main driver.

If I can have Claude write up the plan, and the other models actually execute it, I'd get the best of both worlds.

(Amusingly, I think Codex tolerates being invoked by Claude (de facto tolerated ToS violation), but not the other way around.)

	▲	zozbot234 10 minutes ago \| parent [-]
		I don't think there's any ToS violation involved? AIUI you can use GPT models with any harness, at least at present. You could nonetheless have Codex write up the plan to an .md file for Claude (perhaps Sonnet or even Haiku?) to execute.

▲ fathermarz an hour ago | parent | prev | next [-]

I have been seeing this messaging everywhere and I have not noticed this. I have had the inverse with 4.7 over 4.6.

I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.

Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.

▲ anabranch 4 hours ago | parent | prev | next [-]

I wanted to better understand the potential impact for the tokenizer change from 4.6 and 4.7.

I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.

	▲	pawelduda 3 hours ago \| parent [-]
		Not very encouraging for longer use, especially that the longer the conversation, the higher the chance the agent will go off the rails

▲ bobjordan 2 hours ago | parent | prev | next [-]

I've spent the past 4+ months building an internal multi-agent orchestrator for coding teams. Agents communicate through a coordination protocol we built, and all inter-agent messages plus runtime metrics are logged to a database.

Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.

I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.

Opus was launched with:

`export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized`

Codex was launched with:

`codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-high`

What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.

I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.

So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.

	▲	pitched 2 hours ago \| parent [-]
		I just switched fully into Codex today, off of Claude. The higher usage limits were one factor but I’m also working towards a custom harness that better integrates into the orchestrator. So the Claude TOS was also getting in the way.

▲ razodactyl 2 hours ago | parent | prev | next [-]

If anyone's had 4.7 update any documents so far - notice how concise it is at getting straight to the point. It rewrote some of my existing documentation (using Windsurf as the harness), not sure I liked the decrease in verbosity (removed columns and combined / compressed concepts) but it makes sense in respect to the model outputting less to save cost.

To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.

What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?

The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.

	▲	andai 2 hours ago \| parent [-]
		Interesting. In conversational use, it's noticeably more verbose.

▲ KellyCriterion 2 hours ago | parent | prev | next [-]

Yesterday, I killed my weekly limit with just three prompts and went into extra usage for ~18USD on top

▲ nmeofthestate an hour ago | parent | prev | next [-]

Is this a weird way of saying Opus got "cheaper" somehow from 4.6 to 4.7?

▲ QuadrupleA an hour ago | parent | prev | next [-]

One thing I don't see often mentioned - OpenAI API's auto token caching approach results in MASSIVE cost savings on agent stuff. Anthropic's deliberate caching is a pain in comparison. Wish they'd just keep the KV cache hot for 60 seconds or so, so we don't have to pay the input costs over and over again, for every growing conversation turn.

▲ monkpit 2 hours ago | parent | prev | next [-]

Does this have anything to do with the default xhigh effort?

▲ ausbah 3 hours ago | parent | prev | next [-]

is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money

▲

zozbot234 2 hours ago | parent | next [-]

> is it really unthinkable that another oss/local model will be released by deepseek, alibaba, or even meta that once again give these companies a run for their money

Plenty of OSS models being released as of late, with GLM and Kimi arguably being the most interesting for the near-SOTA case ("give these companies a run for their money"). Of course, actually running them locally for anything other than very slow Q&A is hard.

▲

rectang 2 hours ago | parent | prev | next [-]

For my working style (fine-grained instructions to the agent), Opus 4.5 is basically ideal. Opus 4.6 and 4.7 seem optimized for more long-running tasks with less back and forth between human and agent; but for me Opus 4.6 was a regression, and it seems like Opus 4.7 will be another.

This gives me hope that even if future versions of Opus continue to target long-running tasks and get more and more expensive while being less-and-less appropriate for my style, that a competitor can build a model akin to Opus 4.5 which is suitable for my workflow, optimizing for other factors like cost.

▲

amelius 3 hours ago | parent | prev | next [-]

I'm betting on a company like Taalas making a model that is perhaps less capable but 100x as fast, where you could have dozens of agents looking at your problem from all different angles simultaneously, and so still have better results and faster.

▲

andai 2 hours ago | parent [-]

Yeah, it's a search problem. When verification is cheap, reducing success rate in exchange for massively reducing cost and runtime is the right approach.

	▲	never_inline 2 hours ago \| parent [-]
		You underestimating the algorithmic complexity of such brute forcing, and the indirect cost of brittle code that's produced by inferior models

▲

casey2 14 minutes ago | parent | prev | next [-]

This regression put Anthropic behind Chinese models actually.

▲

embedding-shape 3 hours ago | parent | prev | next [-]

Nothing is unthinkable, I could think of Transformers.V2 that might look completely different, maybe iterations on Mamba turns out fruitful or countless of other scenarios.

▲

pitched 3 hours ago | parent | prev | next [-]

Now that Anthropic have started hiding the chain of thought tokens, it will be a lot harder for them

	▲	zozbot234 2 hours ago \| parent [-]
		Anthropic and OpenAI never showed the true chain of thought tokens. Ironically, that's something you only get from local models.

▲

slowmovintarget 3 hours ago | parent | prev [-]

Qwen released a new model the same day (3.6). The headline was kind of buried by Anthropic's release, though.

https://news.ycombinator.com/item?id=47792764

▲ jimkleiber 2 hours ago | parent | prev | next [-]

I wonder if this is like when a restaurant introduces a new menu to increase prices.

Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?

I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.

	▲	hopfenspergerj 2 hours ago \| parent [-]
		You can't accidentally retrain a model to use a different tokenizer. It changes the input vectors to the model.

▲ napolux 2 hours ago | parent | prev | next [-]

Token consumption is huge compared to 4.6 even for smaller tasks. Just by "reasoning" after my first prompt this morning I went over 50% over the 5 hours quota.

▲ eezing 43 minutes ago | parent | prev | next [-]

Not sure if this equates to more spend. Smarter models make fewer mistakes and thus fewer round trips.

▲ alphabettsy 2 hours ago | parent | prev | next [-]

I’m trying to understand how this is useful information on its own?

Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?

I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.

	▲	senordevnyc an hour ago \| parent [-]
		Everything in AI moves super quickly, including the hivemind. Anthropic was the darling a few weeks ago after the confrontation with the DoD, but now we hate them because they raised their prices a little. Join us!

▲ aray07 2 hours ago | parent | prev | next [-]

Came to a similar conclusion after running a bunch of tests on the new tokenizer

It was on the higher end of Anthropics range - closer to 30-40% more tokens

https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...

▲ coldtea 3 hours ago | parent | prev | next [-]

This, the push towards per-token API charging, and the rest are just a sign of things to come when they finally establish a moat and full monoply/duopoly, which is also what all the specialized tools like Designer and integrations are about.

It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.

▲

danny_codes 37 minutes ago | parent | next [-]

I doubt that’s the case. My guess is we’ll hit asymptomatic returns from transformers, but price-to-train will fall at moore’s law.

So over time older models will be less valuable, but new models will only be slightly better. Frontier players, therefore, are in a losing business. They need to charge high margins to recoup their high training costs. But latecomers can simply train for a fraction of the cost.

Since performance is asymptomatic, eventually the first-mover advantage is entirely negligible and LLMs become simple commodity.

The only moat I can see is data, but distillation proves that this is easy to subvert.

There will probably be a window though where insiders get very wealthy by offloading onto retail investors, who will be left with the bag.

▲

quux 3 hours ago | parent | prev | next [-]

If only there were an Open AI company who's mandate, built into the structure of the company, were to make frontier models available to everyone for the good of humanity.

Oh well

	▲	slowmovintarget 2 hours ago \| parent [-]
		Things used to be better... really. OpenAI was built as you say. Google had a corporate motto of "Don't be evil" which they removed so they could, um, do evil stuff without cognitive dissonance, I guess. This is the other kind of enshitification where the businesses turn into power accumulators.

▲

throwaway041207 3 hours ago | parent | prev [-]

Yep, between this and the pricing for the code review tool that was released a couple weeks ago (15-25 a review), and the usage pricing and very expensive cost of Claude Design, I do wonder if Anthropic is making a conscious, incremental effort to raise the baseline for AI engineering tasks, especially for enterprise customers.

You could call it a rug pull, but they may just be doing the math and realize this is where pricing needs to shift to before going public.

	▲	zozbot234 2 hours ago \| parent [-]
		There's been speculation that the code review might actually be Mythos. It would seem to explain the cost.

▲ ivanfioravanti 2 hours ago | parent | prev | next [-]

Probably due to the new tokenizer: https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...

▲ ben8bit 2 hours ago | parent | prev | next [-]

Makes me think the model could actually not even be smarter necessarily, just more token dependent.

	▲	hirako2000 2 hours ago \| parent [-]
		Asking a seller to sell less. That's an incentive difficult to reconcile with the user's benefit. To keep this business running they do need to invest to make the best model, period. It happens to be exactly what Anthropic's strategy is. That and great tooling.

▲ l5870uoo9y 2 hours ago | parent | prev | next [-]

My impression the reverse is true when upgrading to GPT-5.4 from GPT-5; it uses fewer tokens(?).

	▲	andai 2 hours ago \| parent [-]
		But with the same tokenizer, right? The difference here is Opus 4.7 has a new tokenizer which converts the same input text to a higher number of tokens. (But it costs the same per token?) > Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and /v1/messages/count_tokens will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6. > Pricing remains the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens. ArtificialAnalysis reports 4.7 significantly reduced output tokens though, and overall ~10% cheaper to run the evals. I don't know how well that translates to Claude Code usage though, which I think is extremely input heavy.

▲ silverwind 2 hours ago | parent | prev | next [-]

Still worth it imho for important code, but it shows that they are hitting a ceiling while trying to improve the model which they try to solve by making it more token-inefficient.

▲ matt3210 2 hours ago | parent | prev | next [-]

Did anyone expect the price to go down? The point of new models is to raise prices

	▲	operatingthetan 2 hours ago \| parent \| next [-]
		The long-term pitch of these AI companies is that the AI will essentially replace workers for low cost. If the models don't get to a higher level of 'intelligence' and still struggle with certain basic tasks at the SOTA while also getting more expensive, then the pitch is misleading and unlikely to happen. So yes, I expect the price to go down.
	▲	ant6n 2 hours ago \| parent \| prev [-]
		I thought it would be to get better, to stay competitive with the competitors and free models.

▲ blahblaher 2 hours ago | parent | prev | next [-]

Conspiracy time: they released a new version just so hey could increase the price so that people wouldn't complain so much along the lines of "see this is a new version model, so we NEED to increase the price") similar to how SaaS companies tack on some shit to the product so that they can increase prices

	▲	willis936 2 hours ago \| parent [-]
		The result is the same: they lose their brand of producing quality output. However the more clever the maneuver they try to pull off the more clear it is to their customers that they are not earning trust. That's what will matter at the end of this. Poor leadership at Claude.

▲ axeldunkel 2 hours ago | parent | prev | next [-]

the better the tokenizer maps text to its internal representation, the better the understanding of the model what you are saying - or coding! But 4.7 is much more verbose in my experience, and this probably drives cost/limits a lot.

▲ gverrilla 33 minutes ago | parent | prev | next [-]

Yeah I'm seriously considering dropping my Max subscription, unless they do something in the next few days - something like dropping Sonnet 4.7 cheap and powerful.

▲ dackdel 2 hours ago | parent | prev | next [-]

releases 4.8 and deletes everything else. and now 4.8 costs 500% more than 4.7. i wonder what it would take for people to start using kimi or qwen or other such.

▲ Shailendra_S 2 hours ago | parent | prev | next [-]

45% is brutal if you're building on top of these models as a bootstrapped founder. The unit economics just don't work anymore at that price point for most indie products.

What I've been doing is running a dual-model setup — use the cheaper/faster model for the heavy lifting where quality variance doesn't matter much, and only route to the expensive one when the output is customer-facing and quality is non-negotiable. Cuts costs significantly without the user noticing any difference.

The real risk is that pricing like this pushes smaller builders toward open models or Chinese labs like Qwen, which I suspect isn't what Anthropic wants long term.

	▲	OptionOfT 2 hours ago \| parent \| next [-]
		That's the risk you take on. There are 2 things to consider: `* Time to market. * Building a house on someone else's land.` You're balancing the 2, hoping that you win the time to market, making the second point obsolete from a cost perspective, or you have money to pivot to DIY.
	▲	c0balt 2 hours ago \| parent \| prev \| next [-]
		One could reconsider whether building your business on top of a model without owning the core skills to make your product is viable regardless. A smaller builder might reconsider (re)acquiring relevant skills and applying them. We don't suddenly lose the ability to program (or hire someone to do it) just because an inference provider is available.
	▲	duped 2 hours ago \| parent \| prev [-]
		> if you're building on top of these models as a bootstrapped founder This is going to be blunt, but this business model is fundamentally unsustainable and "founders" don't get to complain their prospecting costs went up. These businesses are setting themselves up to get Sherlocked. The only realistic exit for these kinds of businesses is to score a couple gold nuggets, sell them to the highest bidder, and leave.

▲ mvkel 2 hours ago | parent | prev | next [-]

The cope is real with this model. Needing an instruction manual to learn how to prompt it "properly" is a glaring regression.

The whole magic of (pre-nerfed) 4.6 was how it magically seemed to understand what I wanted, regardless of how perfectly I articulated it.

Now, Anth says that needing to explicitly define instructions are as a "feature"?!

▲ varispeed an hour ago | parent | prev | next [-]

I spent one day with Opus 4.7 to fix a bug. It just ran in circles despite having the problem "in front of its eyes" with all supporting data, thorough description of the system, test harness that reproduces the bug etc. While I still believe 4.7 is much "smarter" than GPT-5.4 I decided to give it ago. It was giving me dumb answers and going off the rails. After accusing it many times of being a fraud and doing it on purpose so that I spend more money, it fixed the bug in one shot.

Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.

It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.

▲ DeathArrow an hour ago | parent | prev | next [-]

We (my wallet and I) are pretty happy with GLM 5.1 and MiniMax 2.7.

▲ ai_slop_hater 3 hours ago | parent | prev | next [-]

Does anyone know what changed in the tokenizer? Does it output multiple tokens for things that were previously one token?

	▲	quux 3 hours ago \| parent [-]
		It must, if it now outputs more tokens than 4.6's tokenizer for the same input. I think the announcement and model cards provide a little more detail as to what exactly is different

▲ therobots927 3 hours ago | parent | prev | next [-]

Wow this is pretty spectacular. And with the losses anthro and OAI are running, don’t expect this trend to change. You will get incremental output improvements for a dramatically more expensive subscription plan.

▲

falcor84 3 hours ago | parent [-]

Indeed, and if we accept the argument of this tech approaching AGI, we should expect that within x years, the subscription cost may exceed the salary cost of a junior dev.

To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.

	▲	dgellow 2 hours ago \| parent \| next [-]
		If LLMs do reach AGI (assuming we have an actual agreed upon definition), it would make sense to pay way more than a junior salary. But also, LLMs won’t give us AGI (again, assuming we have an actual, meaningful definition)
	▲	therobots927 an hour ago \| parent \| prev [-]
		I absolutely do not accept that argument. It’s clear models hit a plateau roughly a year ago and all incremental improvements come at an increasingly higher cost. And junior devs have never added much value. The first two years of any engineer’s career is essentially an apprenticeship. There’s no value add from have a perpetually junior “employee”.

▲ QuadrupleA an hour ago | parent | prev | next [-]

Definitely seems like AI money got tight the last month or two - that the free beer is running out and enshittification has begun.

▲ justindotdev 3 hours ago | parent | prev | next [-]

i think it is quite clear that staying with opus 4.6 is the way to go, on top of the inflation, 4.7 is quite... dumb. i think they have lobotomized this model while they were prioritizing cybersecurity and blocking people from performing potentially harmful security related tasks.

▲

bcherny 3 hours ago | parent | next [-]

Hey, Boris from the Claude Code team here. People were getting extra cyber warnings when using old versions of Claude Code with Opus 4.7. To fix it, just run claude update to make sure you're on the latest.

Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.

More here: https://x.com/ClaudeDevs/status/2045238786339299431

▲

bakugo 2 hours ago | parent [-]

How do you justify the API and web UI versions of 4.7 refusing to solve NYT Connections puzzles due to "safety"?

https://x.com/LechMazur/status/2044945702682309086

▲

templar_snow 2 hours ago | parent [-]

To be fair, reading the New York Times is a safety risk for any intelligent life form these days. But still.

	▲	maleldil 2 hours ago \| parent [-]
		You don't need to subscribe to the NYT to play the games. There's a separate subscription.

▲

vessenes 3 hours ago | parent | prev [-]

4.7 is super variable in my one day experience - it occasionally just nails a task. Then I'm back to arguing with it like it's 2023.

	▲	aenis 2 hours ago \| parent \| next [-]
		My experience as well, unfortunately. I am really looking forward to reading, in a few years, a proper history of the wild west years of AI scaling. What is happening in those companies at the moment must be truly fascinating. How is it possible, for instance, that I never, ever, had an instance of not being able to use Claude despite the runaway success it had, and - i'd guess - expotential increase in infra needs. When I run production workloads on vertex or bedrock i am routinely confronted with quotas, here - it always works.
	▲	dgellow 2 hours ago \| parent \| prev [-]
		That has been my Friday experience as well… very frustrating to go back to the arguing, I forgot how tense that makes me feel

▲ micromacrofoot 2 hours ago | parent | prev | next [-]

The latest qwen actually performs a little better for some tasks, in my experience

latest claude still fails the car wash test

▲ fny 3 hours ago | parent | prev | next [-]

I'm going to suggest what's going on here is Hanlon's Razor for models: "Never attribute to malice that which is adequately explained by a model's stupidity."

In my opinion, we've reached some ceiling where more tokens lead only to incremental improvements. A conspiracy seems unlikely given all providers are still competing for customers and a 50% token drives infra costs up dramatically too.

	▲	willis936 2 hours ago \| parent [-]
		Never attribute to incompetence what is sufficiently explained by greed.

▲ bparsons 2 hours ago | parent | prev | next [-]

Had a pretty heavy workload yesterday, and never hid the limit on claude code. Perhaps they allowed for more tokens for the launch?

Claude design on the other hand seemed to eat through (its own separate usage limit) very fast. Hit the limit this morning in about 45 mins on a max plan. I assume they are going to end up spinning that product off as a separate service.

▲ alekseyrozh an hour ago | parent | prev | next [-]

Is it just me? I don't feel difference between 4.6 and 4.7

▲ monkeydust 2 hours ago | parent | prev [-]

'sixxxx, seeeeven'....sorry have little kids, couldn't resist but perhaps that explains what's going on!