There needs to be something like this from Ollama. At the moment Ollama has a lot of flaws that prevent it from getting great performance. (My understanding is better GPU/CPU splits, etc). But Ollama is the only way to host an LLM and have it switch out on demand. Sigh.

▲

zozbot234 6 hours ago | parent | next [-]

Ollama has very substandard support for mmap at present, which hurts inference with larger models. There are some recent pull requests in flight that should help address this to at least some extent https://github.com/ollama/ollama/pull/14525 https://github.com/ollama/ollama/pull/14134 https://github.com/ollama/ollama/pull/14864 but progress seems to be stalling. Their support for recent Qwen models seems to also have some bespoke incompatibilities with llama.cpp, which doesn't help matters; it's difficult to test the same model with both.

▲

rubiquity 6 hours ago | parent | prev [-]

llama.cpp and llama-swap do this better than Ollama and with far more control.

	▲	circularfoyers 4 hours ago \| parent [-]
		Don't even need to use llama-swap anymore now that llama-server supports the same functionality.