By perf I mean how much does it cost to serve 1T model to 1M users at 50 tokens/sec.

All 1T models are not equal. E.g. how many active parameters? what's the native quantization? how long is the max context? Also, it's quite likely that some smaller models in common use are even sub-1T. If your model is light enough, the lower throughput doesn't necessarily hurt you all that much and you can enjoy the lightning-fast speed.

▲

p1esk 4 hours ago | parent | next [-]

Just pick some reasonable values. Also, keep in mind that this hardware must still be useful 3 years from now. What’s going to happen to cerebras in 3 years? What about nvidia? Which one is a safer bet?

On the other hand, competition is good - nvidia can’t have the whole pie forever.

▲

zozbot234 4 hours ago | parent [-]

> Just pick some reasonable values.

And that's the point - what's "reasonable" depends on the hardware and is far from fixed. Some users here are saying that this model is "blazing fast" but a bit weaker than expected, and one might've guessed as much.

> On the other hand, competition is good - nvidia can’t have the whole pie forever.

Sure, but arguably the closest thing to competition for nVidia is TPUs and future custom ASICs that will likely save a lot on energy used per model inference, while not focusing all that much on being super fast.

	▲	latchkey 3 hours ago \| parent [-]
		AMD

▲

wiredpancake 3 hours ago | parent | prev [-]

[dead]