▲ | NitpickLawyer 5 days ago | |||||||
Correct. You want everything loaded, but for each forward pass just some experts get activated so the computation is less than in a dense model. That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow. | ||||||||
▲ | theanonymousone 5 days ago | parent [-] | |||||||
Can you give me a name please? Is that distributed llama or something else? | ||||||||
|