I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.

zozbot234 6 hours ago | parent | next [-]

Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.

▲

Barathkanna 5 hours ago | parent | prev [-]

I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s

These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.

	▲	ffsm8 3 hours ago \| parent [-]
		You do realize that's entirely made up, right? Could be true, could be fake - the only thing we can be sure of is that it's made up with no basis in reality. This is not how you use llms effectively, that's how you give everyone that's using them a bad name from association