| ▲ | dev_l1x_be 2 hours ago | |
How do you split the model between multiple GPUs? | ||
| ▲ | evilduck 2 hours ago | parent [-] | |
With "only" 32B active params, you don't necessarily need to. We're straying from common home users to serious enthusiasts and professionals but this seems like it would run ok on a workstation with a half terabyte of RAM and a single RTX6000. But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo... | ||