I tried a few things and checked CPU usage in Task Manager to see how much work the CPU is doing.
KV Cache in GPU and 36/36 layers in GPU: CPU usage under 3%.
KV Cache in GPU and 35/36 layers in GPU: CPU usage at 35%.
KV Cache moved to CPU and 36/36 layers in GPU: CPU usage at 34%.
I believe you that it doesn't make sense to do it this way, it is slower, but it doesn't appear to be doing much of anything on the CPU.
You say gigabytes of weights PER TOKEN, is that true? I think an expert is about 2 GB, so a new expert is 2 GB, sure - but I might have all the experts for the token already in memory, no?