Remix.run Logo
windexh8er 3 hours ago

What about 15k tokens per second? [0] I remember looking at this earlier in the year and it being so fast that it feels fake. And, yes, this model is old - but still awesome for what it is.

[0] https://chatjimmy.ai/

Kirby64 2 hours ago | parent | next [-]

It’s not just old, it’s also tiny and quantized. It’s llama 3.1 8b at 3/6-bit quant. This is the type of thing you can run on almost any device…

windexh8er an hour ago | parent [-]

I get that, but not at 15k tokens/s.

Kirby64 an hour ago | parent [-]

But it’s irrelevant. 750 tokens/s on a full frontier model is useful. 15000 poor quality tokens is much less useful no matter how much scaffolding you put around it.

Legend2440 21 minutes ago | parent | next [-]

You are missing the point. This is a technology demonstration on prototype hardware, and no one intends it to be seriously useful.

Their architecture has fundamental speed and efficiency advantages over GPUs or Cerebras. They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.

Kirby64 14 minutes ago | parent [-]

> They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.

I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view.

I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes.

windexh8er an hour ago | parent | prev [-]

I think you missed the point and don't understand / aren't considerate of SLM utility.

Kirby64 40 minutes ago | parent [-]

But I’m not missing the point. If you can run one frontier model at 750t/s, then you can probably run many many instances of an SLM in parallel at a rate that exceeds 15k/s. That’s kinda the point of the flash or ultrafast variants. And they’re on something much more modern than llama3.1.

windexh8er 32 minutes ago | parent [-]

Yes, you are missing the point. 1) It's a demo. [0] 2) It hasn't been updated for 4+ months.

You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point.

And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff.

[0] https://taalas.com/products/

Kirby64 8 minutes ago | parent [-]

Why are you representing this as such a binary here? For SLM we don’t need the Taalas stuff at all. Just run it locally on your own device if it’s truly a small model. And there’s plenty of larger models that can be run on-premise just fine.

I think it’s impressive that a frontier model can achieve 750t/s. That’s all. You can get similar insane token speeds from other open weight models too.

partsch 2 hours ago | parent | prev [-]

They baked the LLM into a CPU