▲ | llmtosser 7 days ago | |||||||
Distractions like this probably the reason they still, over a year now, do not support sharded GGUF. https://github.com/ollama/ollama/issues/5245 If any of the major inference engines - vLLM, Sglang, llama.cpp - incorporated api driven model switching, automatic model unload after idle and automatic CPU layer offloading to avoid OOM it would avoid the need for ollama. | ||||||||
▲ | jychang 7 days ago | parent [-] | |||||||
That’s just llama-swap and llama.cpp | ||||||||
|