Remix.run Logo
llmtosser 7 days ago

Distractions like this probably the reason they still, over a year now, do not support sharded GGUF.

https://github.com/ollama/ollama/issues/5245

If any of the major inference engines - vLLM, Sglang, llama.cpp - incorporated api driven model switching, automatic model unload after idle and automatic CPU layer offloading to avoid OOM it would avoid the need for ollama.

jychang 7 days ago | parent [-]

That’s just llama-swap and llama.cpp

llmtosser 7 days ago | parent [-]

Interesting - it does indeed seem like llama-server has the needed endpoints to do the model swapping and llama.cpp as of recently also has a new flag for the dynamic CPU offload now.

However the approach to model swapping is not 'ollama compatible' which means all the OSS tools supporting 'ollama' Ex Openwebui, Openhands, Bolt.diy, n8n, flowise, browser-use etc.. aren't able to take advantage of this particularly useful capability as best I can tell.