Remix.run Logo
colordrops 4 hours ago

I assume they didn't fix the memory bandwidth pain point though.

llm_nerd 4 hours ago | parent | next [-]

The memory bandwidth limitation is baked into the GB10, and every vendor is going to be very similar there.

I'm really curious to see how things shift when the M5 Ultra with "tensor" matmul functionality in the GPU cores rolls out. This should be a multiples speed up of that platform.

storus 4 hours ago | parent [-]

My guess is M5 Ultra will be like DGX Spark for token prefill and M3 Ultra for token generation, i.e. the best of both worlds, at FP4. Right now you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part; with M5U that should no longer be necessary. However given RAM prices situation I am wondering if M5U will ever get close to the price/performance of Spark + M3U we have right now.

echion 23 minutes ago | parent [-]

> you can combine Spark with M3U, the former streaming the compute, lowering TTFT, the latter doing the token generation part

Are you doing this with vLLM, or some other model-running library/setup?

coder543 20 minutes ago | parent [-]

They're probably referencing this article: https://blog.exolabs.net/nvidia-dgx-spark/

cat_plus_plus 2 hours ago | parent | prev [-]

At least for transformers, it can be kind of fixed with MOE + NVFP4 for small working set despite large resident size.