Remix.run Logo
sberens 4 hours ago

For comparison, openrouter says opus 4.8 is ~55 tokens/s and fast mode is ~102.

750 tokens/s for their largest model is going to be nuts

windexh8er 2 hours ago | parent | next [-]

What about 15k tokens per second? [0] I remember looking at this earlier in the year and it being so fast that it feels fake. And, yes, this model is old - but still awesome for what it is.

[0] https://chatjimmy.ai/

Kirby64 an hour ago | parent | next [-]

It’s not just old, it’s also tiny and quantized. It’s llama 3.1 8b at 3/6-bit quant. This is the type of thing you can run on almost any device…

partsch an hour ago | parent | prev [-]

They baked the LLM into a CPU

comboy 2 hours ago | parent | prev | next [-]

But it seems that there is some queuing/load balancing on their side, I mean when opus is actually outputting this 55t/s it feles fast, but apart from it's internal reasoning I think there's sometimes just waiting.

fragmede an hour ago | parent [-]

Oh wait yeah good point. At 750 tokens a second and the same amount of human patients they can set it to think for the same amount of time but four or five times the amount of thinking tokens, which may improve the quality of the eventual output.

gandreani 3 hours ago | parent | prev | next [-]

Using gpt-5.4-mini in off-peak hours already feels like super-speed to me. That's probably no more than 100-150 tk/s. I can't imagine 750!

I've always eyed Cerebras but never had a use for it that would justify paying for the API directly. Although now that I think about it, trying out the API would probably cost less than a subscription for a month...

jasonjmcghee 2 hours ago | parent | next [-]

Try gpt-5.3-codex-spark - it's 1000 TPS and from my experience more capable than 5.4 mini.

If you have a subscription it's a different pool of usage.

small_model 2 hours ago | parent [-]

Used it, very fast but tiny context window and doesn't have good reasoning. (good for quick simple code changes)

beering an hour ago | parent [-]

Agreed, 1000tok/s just fills up the context window (which is big by 2004 standards) super fast. But seems like 5.3-spark was just a taste of what’s to come.

taneq an hour ago | parent [-]

2004 standards? O.o

partsch an hour ago | parent [-]

1904

embedding-shape 3 hours ago | parent | prev | next [-]

The ChatGPT subscription gives you access to the -spark model(s) in Codex which are blazing fast (but pretty dumb) which I think runs on Cerebras hardware too.

kegs_ 3 hours ago | parent | prev [-]

I have a pretty good use case for gpt-oss. The amount of time savings has actually been wild. Definitely worth a try. Just to be clear, it gets like 2000tok/s

order-matters 3 hours ago | parent | prev [-]

the more advanced models also utilize a lot more tokens, and a lot of these extra tokens may go towards safeguards at a higher rate than prior models as well.

not to say a speed boost isnt there but if they didnt increase tokens / s at all youd likely see things slow down a lot with the new model compared to current

beering an hour ago | parent [-]

I think regular users will still have the old speed, so should be easy to tell whether it is more thinkier than 5.5.