Remix.run Logo
lolinder 7 months ago

Llama.cpp forms the base for both Ollama and Kobold.cpp and probably a bunch of others I'm not familiar with. It's less a question of whether you want to use llama.cpp or one of the others and more of a question of whether you benefit from using one of the wrappers.

I can imagine some use cases where you'd really want to use llama.cpp directly, and there are of course always people who will argue that all wrappers are bad wrappers, but for myself I like the combination of ease of use and flexibility offered by Ollama. I wrap it in Open WebUI for a GUI, but I also have some apps that reach out to Ollama directly.

freehorse 7 months ago | parent [-]

What is the advantage of ollama/open webui let's say, vs llama-server? I have been using llama.cpp since when it came out, I am used to the syntax etc and I do not have problems building it (probably because I use macos which it supports better), so am I missing something from not using ollama?

lolinder 7 months ago | parent [-]

Llama.cpp may have gotten better, but when I first started using them:

* Ollama provides a very convenient way to download and manage models, which llama.cpp didn't at the time (maybe it does now?).

* Last I checked, with llama.cpp server you pick a model on server startup. Ollama instead allows specifying the model in the API request, and it handles switching out which one is loaded into vram automatically.

* The Modelfile abstraction is a more helpful way to keep track of different settings. When I used llama.cpp directly I had to figure out a way to track a bunch of model parameters as bash flags. Modelfiles + being able to specify the model in the request is a great synergy, allowing clients to not have to think about parameters at all, just which Modelfile to use.

I'm leaving off some other reasons why I switched which I know have gotten better (like Ollama having great Docker support, which wasn't always true for llama.cpp), but some of these may also have improved with time. A glance over the docs suggests that you still can't specify a model at runtime in the request, though, which if true is probably the single biggest improvement that Ollama offers. I'm constantly switching between tiny models that fit on my GPU and large models that use RAM, and being able to download a new model with ollama pull and immediately start using it in Open WebUI is a huge plus.