▲ | EnPissant 4 days ago | ||||||||||||||||||||||||||||||||||||||||
FWIW, that's a 80GB model and you also need kv cache. You'd need 96GBish to run on the GPU. | |||||||||||||||||||||||||||||||||||||||||
▲ | furyofantares 4 days ago | parent [-] | ||||||||||||||||||||||||||||||||||||||||
Do you know if it's doing what was described earlier, when I run it with all layers on GPU - paging an expert in every time the expert changes? Each expert is only 5.1B parameters. | |||||||||||||||||||||||||||||||||||||||||
|