I think it's important to note that there's nothing forbidding LPU style determinism from being used in training. They just didn't make that choice.

Also tenstorrent could be a viable challenger in this space. It seems to me that their NoC and their chips could be mostly deterministic as long as you don't start adding in branches

▲

ossa-ma 20 hours ago | parent | next [-]

You're right but my understanding is that Groq's LPU architecture makes it inference-only in practice.

Like Groq's chips only have 230MB of SRAM per chip vs 80GB on an H100, training is memory hungry as you need to hold model weights + gradients + optimizer states + intermediate activations.

	▲	refibrillator 19 hours ago \| parent [-]
		H100 has 80 GB of HBM3. There’s only like 37 MB of SRAM on a single chip.

▲

bionhoward 20 hours ago | parent | prev [-]

Would SRAM make weight updates prohibitive vs DRAM?