| ▲ | embedding-shape 6 hours ago | |||||||
I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio. | ||||||||
| ▲ | zozbot234 6 hours ago | parent | next [-] | |||||||
Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom. | ||||||||
| ▲ | Barathkanna 5 hours ago | parent | prev [-] | |||||||
I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill. | ||||||||
| ||||||||