Remix.run Logo
zmmmmm 4 hours ago

What can it actually run? The fact their benchmark plot refers to Llama 3.1 8b signals to me that it's hand implemented for that model and likely can't run newer / larger models. Why else would you benchmark such an outdated model? Show me a benchmark for gpt-oss-120b or something similar to that.

sanxiyn 4 hours ago | parent | next [-]

Looking at their blog, they in fact ran gpt-oss-120b: https://furiosa.ai/blog/serving-gpt-oss-120b-at-5-8-ms-tpot-...

I think Llama 3 focus mostly reflects demand. It may be hard to believe, but many people aren't even aware gpt-oss exists.

reactordev 3 hours ago | parent | next [-]

Many are aware, just can’t offload it onto their hardware.

The 8B models are easier to run on an RTX to compare it to local inference. What llama does on an RTX 5080 at 40t/s, Furiosa should do at 40,000t/s or whatever… it’s an easy way to have a flat comparison across all the different hardware llama.cpp runs on.

nl 3 hours ago | parent | prev | next [-]

> we demonstrated running gpt-oss-120b on two RNGD chips [snip] at 5.8 ms per output token

That's 86 token/second/chip

By comparison, a H100 will do 2390 token/second/GPU

Am I comparing the wrong things somehow?

[1] https://inferencemax.semianalysis.com/

sanxiyn 2 hours ago | parent | next [-]

I think you are comparing latency with throughput. You can't take the inverse of latency to get throughput because concurrency is unknown. But then, RNGD result is probably with concurrency=1.

binary132 3 hours ago | parent | prev [-]

I thought they were saying it was more efficient, as in tokens per watt. I didn’t see a direct comparison on that metric but maybe I didn’t look well enough.

nl 2 hours ago | parent [-]

Probably. Companies sell on efficiency when they know they lose on performance.

zmmmmm 3 hours ago | parent | prev [-]

Now I'm interested ...

It still kind of makes the point that you are stuck with a very limited range of models that they are hand implementing. But at least it's a model I would actually use. Give me that in a box I can put in a standard data center with normal power supply and I'm definitely interested.

But I want to know the cost :-)

rjzzleep 3 hours ago | parent | prev [-]

The fact that so many people are focusing solely on massive LLM models is an oversight by people that narrowly focusing on a tiny (but very lucrative) subdomain of AI applications.