I’m not very well versed in this domain, but I think it’s not going to be “VRAM” (GDDR) memory, but rather “unified memory”, which is essentially RAM (some flavour of DDR5 I assume). These two types of memory has vastly different bandwidth.

I’m pretty curious to see any benchmarks on inference on VRAM vs UM.

▲ banana_giraffe 13 hours ago | parent | next [-]

A quick benchmark using float32 copies using torch cuda->cuda copies, comparing some random machines:

    Raptor Lake + 5080: 380.63 GB/s
    Raptor Lake (CPU for reference): 20.41 GB/s
    GB10 (DGX Spark): 116.14 GB/s
    GH200: 1697.39 GB/s

This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.

In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.

	▲	ekropotin 12 hours ago \| parent [-]
		Nice! Thanks for that. 55 t/s is much better than I could expect.

▲ oofbey 17 hours ago | parent | prev [-]

I’m using VRAM as shorthand for “memory which the AI chip can use” which I think is fairly common shorthand these days. For the spark is it unified, and has lower bandwidth than most any modern GPU. (About 300 GB/s which is comparable to an RTX 3060.)

So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.