Remix.run Logo
Bolwin 3 hours ago

Yeah that's a massive assumption they're making. I remember musk revealed Grok was multiple trillion parameters. I find it likely Opus is larger.

I'm sure Anthropic is making money off the API but I highly doubt it's 90% profit margins.

jychang 2 hours ago | parent | next [-]

> I find it likely Opus is larger.

Unlikely. Amazon Bedrock serves Opus at 120tokens/sec.

If you want to estimate "the actual price to serve Opus", a good rough estimate is to find the price max(Deepseek, Qwen, Kimi, GLM) and multiply it by 2-3. That would be a pretty close guess to actual inference cost for Opus.

It's impossible for Opus to be something like 10x the active params as the chinese models. My guess is something around 50-100b active params, 800-1600b total params. I can be off by a factor of ~2, but I know I am not off by a factor of 10.

simianwords 2 hours ago | parent [-]

Are you sure you can use tps as a proxy?

jychang an hour ago | parent [-]

In practice, tps is a reflection of vram memory bandwidth during inference. So the tps tells you a lot about the hardware you're running on.

Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.

I won't say it'll tell you everything; I have no clue what optimizations Opus may have, which can range from native FP4 experts to spec decoding with MTP to whatever. But considering chinese models like Deepseek and GLM have MTP layers (no clue if Qwen 3.5 has MTP, I haven't checked since its release), and Kimi is native int4, I'm pretty confident that there is not a 10x difference between Opus and the chinese models. I would say there's roughly a 2x-3x difference between Opus 4.5/4.6 and the chinese models at most.

fc417fc802 40 minutes ago | parent [-]

> In practice, tps is a reflection of vram memory bandwidth during inference.

> Comparing tps ratios- by saying a model is roughly 2x faster or slower than another model- can tell you a lot about the active param count.

You sure about that? I thought you could shard between GPUs along layer boundaries during inference (but not training obviously). You just end up with an increasingly deep pipeline. So time to first token increases but aggregate tps also increases as you add additional hardware.

jychang 27 minutes ago | parent [-]

That doesn't work. Think about it a bit more.

Hint: what's in the kv cache for the 2nd token?

And that's called layer parallelism (as opposed to tensor parallelism). It allows you to run larger models (pooling vram across gpus) but does not allow you to run models faster.

Tensor parallelism DOES allow you to run models faster across multiple GPUs, but you're limited to how fast you can synchronize the all-reduce. And in general, models would have the same boost on the same hardware- so the chinese models would have the same perf multiplier as Opus.

Note that providers generally use tensor parallelism as much as they can, for all models. That usually means 8x or so.

In reality, tps ends up being a pretty good proxy for active param size when comparing different models at the same inference provider.

nbardy an hour ago | parent | prev | next [-]

You can estimate on tok/second

The Trillions of parameters claim is about the pretraining.

It’s most efficient in pre training to train the biggest models possible. You get sample efficiency increase for each parameter increase.

However those models end up very sparse and incredibly distillable.

And it’s way too expensive and slow to serve models that size so they are distilled down a lot.

2 hours ago | parent | prev | next [-]
[deleted]
aurareturn 2 hours ago | parent | prev [-]

Anthropic CEO said 50%+ margins in an interview. I'm guessing 50 - 60% right now.