Somewhat related - on several occasions I've come across the claim that _"Ollama is just a llama.cpp wrapper"_, which is inaccurate and completely misses the point. I am sharing my response here to avoid repeating myself repeatedly.

With llama.cpp running on a machine, how do you connect your LLM clients to it and request a model gets loaded with a given set of parameters and templates?

... you can't, because llama.cpp is the inference engine - and it's bundled llama-cpp-server binary only provides relatively basic server functionality - it's really more of demo/example or MVP.

Llama.cpp is all configured at the time you run the binary and manually provide it command line args for the one specific model and configuration you start it with.

Ollama provides a server and client for interfacing and packaging models, such as:

  - Hot loading models (e.g. when you request a model from your client Ollama will load it on demand).
  - Automatic model parallelisation.
  - Automatic model concurrency.
  - Automatic memory calculations for layer and GPU/CPU placement.
  - Layered model configuration (basically docker images for models).
  - Templating and distribution of model parameters, templates in a container image.
  - Near feature complete OpenAI compatible API as well as it's native native API that supports more advanced features such as model hot loading, context management, etc...
  - Native libraries for common languages.
  - Official container images for hosting.
  - Provides a client/server model for running remote or local inference servers with either Ollama or openai compatible clients.
  - Support for both an official and self hosted model and template repositories.
  - Support for multi-modal / Vision LLMs - something that llama.cpp is not focusing on providing currently.
  - Support for serving safetensors models, as well as running and creating models directly from their Huggingface model ID.

In addition to the llama.cpp engine, Ollama are working on adding additional model backends (e.g. things like exl2, awq, etc...).

Ollama is not "better" or "worse" than llama.cpp because it's an entirely different tool.

▲ spencerchubb a year ago | parent | next [-]

I think what you just said actually reinforces the point that ollama is a llama.cpp wrapper. I don't say that to disparage ollama, in fact I love ollama. It is an impressive piece of software. If x uses y under the hood, then we say "x is a y wrapper"

▲

smcleod a year ago | parent [-]

I mean.... is Debian just a libc6 wrapper? Is Firefox just a JavaScript wrapper?

Given Ollama currently has llama.cpp, mllama and safetensors backends, there's far more Ollama code and functionality than code that calls llama.cpp

	▲	spencerchubb a year ago \| parent [-]
		why do you say "just" a wrapper? it is not a bad thing to be a wrapper, it is just a descriptive term the amount of code does not dictate whether or not it is a wrapper.

▲ int_19h a year ago | parent | prev [-]

The biggest frustration with Ollama is that it's very opinionated about the way it stores models for usage. If all you use is Ollama, that doesn't matter much, but it's frustrating when that GGUF needs to be shared with other things.