| ▲ | cpldcpu 2 hours ago | |
I wonder how well this works with MoE architectures? For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware. With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM. At that point we are back to a chiplet approach... | ||
| ▲ | brainless 15 minutes ago | parent | next [-] | |
If each of the Expert models were etched in Silicon, it would still have massive speed boost, isn't it? I feel printing ASIC is the main block here. | ||
| ▲ | pests an hour ago | parent | prev [-] | |
For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch. They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models. The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses. Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here. *ed: SpareCubes to SparseCubes | ||