How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?

Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?

I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software

▲

xrd 38 minutes ago | parent | next [-]

This is a good place to start reading about dual gpus.

https://github.com/noonghunna/club-3090/blob/master/docs/DUA...

	▲	nextaccountic 11 minutes ago \| parent [-]
		But in this case he used a cpu too

▲

nodja an hour ago | parent | prev | next [-]

Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.

▲

segmondy 2 hours ago | parent | prev [-]

checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.