Remix clone Hacker News

new | show | ask | jobs Github

	▲	daemonologist 7 hours ago
		Most definitely - the popular engines have extensive support for doing this and controlling exactly which weights end up where (llama.cpp: https://github.com/ggml-org/llama.cpp/blob/master/tools/cli/... , vllm: https://docs.vllm.ai/en/stable/configuration/engine_args/#of... , sglang (haven't tried this): https://docs.sglang.io/advanced_features/server_arguments.ht...). Even with a MoE model, which has to move a relatively small portion of the weights around, you do end up quite bandwidth constrained though.