| ▲ | bertili 9 hours ago | ||||||||||||||||
Very impressive! I wonder if there is a similar path for Linux using system memory instead of SSD? Hell, maybe even a case for the return of some kind of ROMs of weights? | |||||||||||||||||
| ▲ | daemonologist 7 hours ago | parent | next [-] | ||||||||||||||||
Most definitely - the popular engines have extensive support for doing this and controlling exactly which weights end up where (llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/... , vllm: https://docs.vllm.ai/en/stable/configuration/engine_args/#of... , sglang (haven't tried this): https://docs.sglang.io/advanced_features/server_arguments.ht...). Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though. | |||||||||||||||||
| ▲ | zozbot234 8 hours ago | parent | prev | next [-] | ||||||||||||||||
Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model. | |||||||||||||||||
| ▲ | Aurornis 7 hours ago | parent | prev | next [-] | ||||||||||||||||
Using system memory and CPU compute for some of the layers that don’t fit into GPU memory is already supported by common tools. It’s workable for mixture of experts models but the performance falls off a cliff as soon as the model overflows out of the GPU and into system RAM. There is another performance cliff when the model has to be fetched from disk on every pass. | |||||||||||||||||
| |||||||||||||||||
| ▲ | K0balt 8 hours ago | parent | prev [-] | ||||||||||||||||
My thoughts exactly. Something like this could make it so that modest GPU capacity, like a pair of 3090s , and lots of RAM could make big inference more practical for personal labs | |||||||||||||||||