Remix clone Hacker News

new | show | ask | jobs Github

	▲	dev_l1x_be 2 hours ago
		How do you split the model between multiple GPUs?
	▲	evilduck 2 hours ago \| parent [-]
		With "only" 32B active params, you don't necessarily need to. We're straying from common home users to serious enthusiasts and professionals but this seems like it would run ok on a workstation with a half terabyte of RAM and a single RTX6000. But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo...