It uses 10 chips for 8B model. It’d need 80 chips for an 80b model.

Each chip is the size of an H100.

So 80 H100 to run at this speed. Can’t change the model after you manufacture the chips since it’s etched into silicon.

9cb14c1ec0 4 hours ago | parent | next [-]

As many others in this conversation have asked, can we have some sources on the idea that the model is spread across chips? You keep making the claim, but no one (myself included) else has any idea where that information comes from or if it is correct.

	▲	aurareturn 4 hours ago \| parent [-]
		I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with only 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.

▲

grzracz 5 hours ago | parent | prev | next [-]

I'm sure there is plenty of optimization paths left for them if they're a startup. And imho smaller models will keep getting better. And a great business model for people having to buy your chips for each new LLM release :)

	▲	aurareturn 5 hours ago \| parent [-]
		One more thing. It seems like this is a Q3 quant. So only 3GB RAM requirement. 10 H100 chips for 3GB model. I think it’s a niche of a niche at this point. I’m not sure what optimization they can do since a transistor is a transistor.

▲

ubercore 5 hours ago | parent | prev [-]

Do we know that it needs 10 chips to run the model? Or are the servers for the API and chatbot just specced with 10 boards to distribute user load?

	▲	FieryTransition 5 hours ago \| parent [-]
		If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.