Even with MoE you still need enough memory to load all experts. For each token, only 8 experts (out of 256) are activated, but which experts are chosen changes dynamically based on the input. This means you'll be constantly loading and unloading experts from disk.

MoEs is great for distributed deployments, because you can maintain a distribution of experts that matches your workload, and you can try to saturate each expert and thereby saturate each node.

▲

zozbot234 15 hours ago | parent [-]

Loading and unloading data from disk is highly preferable to sending the same amount of data over a bottlenecked Thunderbolt 5 connection.

	▲	rahimnathwani 15 hours ago \| parent [-]
		No it's not. With a cluster of two 512GB nodes, you have to send half the weights (350GB) over a TB5 connection. But you have to do this exactly once on startup. With a single 512GB node, you'll be loading weights from disk each time you need a different expert, potentially for each token. Depending on how many experts you're loading, you might be loading 2GB to 20GB from disk each time. Unless you're going to shut down your computer after generating a couple of hundred tokens, the cluster wins.