The largest nodes in his cluster each have 512GB RAM. DeepSeek V3.1 is a 671B parameter model whose weights take up 700GB RAM: https://huggingface.co/deepseek-ai/DeepSeek-V3.1

I would have expected that going from one node (which can't hold the weights in RAM) to two nodes would have increased inference speed by more than the measured 32% (21.1t/s -> 27.8t/s).

With no constraint on RAM (4 nodes) the inference speed is less than 50% faster than with only 512GB.

Am I missing something?

▲

elorant 16 hours ago | parent | next [-]

You only get 80Gbps network bandwidth. There's your bottleneck right there. Infiniband in comparison can give you up to x10 times that.

▲

zozbot234 16 hours ago | parent | prev | next [-]

Weights are read-only data so they can just be memory mapped and reside on SSD (only a small fraction will be needed in VRAM at any given time), the real constraint is activations. MoE architecture should help quite a bit here.

▲

Dylan16807 15 hours ago | parent | next [-]

You need all the weights every token, so even with optimal splitting the fraction of the weights you can farm out to an SSD is proportional to how fast your SSD is compared to your RAM.

You'd need to be in a weirdly compute-limited situation before you can replace significant amounts of RAM with SSD, unless I'm missing something big.

> MoE architecture should help quite a bit here.

In that you're actually using a smaller model and swapping between them less frequently, sure.

▲

rahimnathwani 15 hours ago | parent | prev | next [-]

Even with MoE you still need enough memory to load all experts. For each token, only 8 experts (out of 256) are activated, but which experts are chosen changes dynamically based on the input. This means you'll be constantly loading and unloading experts from disk.

MoEs is great for distributed deployments, because you can maintain a distribution of experts that matches your workload, and you can try to saturate each expert and thereby saturate each node.

▲

zozbot234 15 hours ago | parent [-]

Loading and unloading data from disk is highly preferable to sending the same amount of data over a bottlenecked Thunderbolt 5 connection.

	▲	rahimnathwani 15 hours ago \| parent [-]
		No it's not. With a cluster of two 512GB nodes, you have to send half the weights (350GB) over a TB5 connection. But you have to do this exactly once on startup. With a single 512GB node, you'll be loading weights from disk each time you need a different expert, potentially for each token. Depending on how many experts you're loading, you might be loading 2GB to 20GB from disk each time. Unless you're going to shut down your computer after generating a couple of hundred tokens, the cluster wins.

▲

hu3 16 hours ago | parent | prev [-]

> only a small fraction will be needed in VRAM at any given time

I don't think that's true. At least not without heavy performance loss in which case "just be memory mapped" is doing a lot of work here.

By that logic GPUs could run models much larger than their VRAM would otherwise allow, which doesn't seem to be the case unless heavy quantization is involved.

	▲	zozbot234 15 hours ago \| parent [-]
		Existing GPU API's are sadly not conducive to this kind of memory mapping with automated swap-in. The closest thing you get AIUI is "sparse" allocations in VRAM, such that only a small fraction of your "virtual address space" equivalent is mapped to real data, and the mapping can be dynamic.

▲

zeusk 16 hours ago | parent | prev [-]

the TB5 link (RDMA) is much slower than direct access to system memory