For most users that wanted to run LLM locally, ollama solved the UX problem.

One command, and you are running the models even with the rocm drivers without knowing.

If llama provides such UX, they failed terrible at communicating that. Starting with the name. Llama.cpp: that's a cpp library! Ollama is the wrapper. That's the mental model. I don't want to build my own program! I just want to have fun :-P

▲

JKCalhoun 6 minutes ago | parent | next [-]

"LM Studio… Jan… Msty… koboldcpp…"

Plenty of alternatives listed. Can anyone with experience suggest the likely successor to Ollama? I have a Mac Mini but don't mind a C/L tool.

I think, as was pointed out, Ollama won because of how easy it is to set up, pull down new models. I would expect similar for a replacement.

▲

anakaine 5 hours ago | parent | prev | next [-]

Llama.cpp now has a gui installed by default. It previously lacked this. Times have changed.

▲

nikodunk 5 hours ago | parent | next [-]

Having read above article, I just gave llama.cpp a shot. It is as easy as the author says now, though definitely not documented quite as well. My quickstart:

brew install llama.cpp

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF --port 8000

Go to localhost:8000 for the Web UI. On Linux it accelerates correctly on my AMD GPU, which Ollama failed to do, though of course everyone's mileage seems to vary on this.

▲

teekert 4 hours ago | parent [-]

Was hoping it was so easy :) But I probably need to look into it some more.

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4' llama_model_load_from_file_impl: failed to load model

Edit: @below, I used `nix-shell -p llama-cpp` so not brew related. Could indeed be an older version indeed! I'll check.

▲

adrian_b an hour ago | parent | next [-]

As it has been discussed in a few recent threads on HN, whenever a new model is released, running it successfully may need changes in the inference backends, such as llama.cpp.

There are 2 main reasons. One is the tokenizer, where new tokenizer definitions may be mishandled by the older tokenizer parsers.

The second reason is that each model may implement differently the tool invocations, e.g. by using different delimiter tokens and different text layouts for describing the parameters of a tool invocation.

Therefore running the Gemma-4 models encountered various problems during the first days after their release, especially for the dense 31B model.

Solving these problems required both a new version of llama.cpp (also for other inference backends) and updates in the model chat template and tokenizer configuration files.

So anyone who wants to use Gemma-4 should update to the latest version of llama.cpp and to the latest models from Huggingface, because the latest updates have been a couple of days ago.

▲

roosgit 4 hours ago | parent | prev [-]

I just hit that error a few minutes ago. I build my llama.cpp from source because I use CUDA on Linux. So I made the mistake of trying to run Gemma4 on an older version I had and I got the same error. It’s possible brew installs an older version which doens’t support Gemma4 yet.

▲

teekert 3 hours ago | parent | next [-]

Ah it was indeed just that!

I'm now on:

$ llama --version version: 8770 (82764d8) built with GNU 15.2.0 for Linux x86_64

(From Nix unstable)

And this works as advertised, nice chat interface, but no openai API I guess, so no opencode...

▲

homarp 3 hours ago | parent [-]

check on same port, there is an OpenAI API https://github.com/ggml-org/llama.cpp/tree/master/tools/serv...

	▲	teekert 3 hours ago \| parent [-]
		Good stuff, thanx!

▲

zozbot234 3 hours ago | parent | prev [-]

And that's exactly why llama.cpp is not usable by casual users. They follow the "move fast and break things" model. With ollama, you just have to make sure you're getting/building the latest version.

▲

Eisenstein 2 hours ago | parent [-]

Its not possible to run the latest model architectures without 'moving fast'. The only thing broken here is that they are trying to use an old version with a new model.

	▲	cyanydeez 2 hours ago \| parent [-]
		and Ollama suffered the same fate when wanting to try new models

▲

OtherShrezzing 5 hours ago | parent | prev | next [-]

While that might be true, for as long as its name is “.cpp”, people are going to think it’s a C++ library and avoid it.

▲

eterm 5 hours ago | parent | next [-]

This is the first I'm learning that it isn't just a C++ library.

In fact the first line of the wikipedia article is:

> llama.cpp is an open source software library

	▲	4 hours ago \| parent [-]
		[deleted]

▲

RobotToaster 5 hours ago | parent | prev | next [-]

It would make sense to just make the GUI a separate project, they could call it llama.gui.

▲

homarp 3 hours ago | parent [-]

it is called llama-barn https://github.com/ggml-org/LlamaBarn

	▲	adrian_b an hour ago \| parent [-]
		LlamaBarn is the MacOS app, not the HTTP API server, which is "llama-server". On non-Apple PCs, "llama-server" is what you use, and you can connect to it either with a browser or with an application compatible with the OpenAI API. Perhaps using "llama-server" as the name of the project would have been less confusing for newbies than "llama.cpp". I confess that when I first heard about "llama.cpp" I also thought that it is just a library and that I have to write my own program in order to implement a complete LLM inference backend.

▲

figassis 5 hours ago | parent | prev [-]

This is correct, and I avoided it for this reason, did not have the bandwidth to get into any cpp rabbit hole so just used whatever seemed to abstract it away.

▲

mijoharas 5 hours ago | parent | prev [-]

Frankly I think the cli UX and documentation is still much better for ollama.

It makes a bunch of decisions for you so you don't have to think much to get a model up and running.

▲

omgitspavel 3 hours ago | parent | prev | next [-]

agree. We can easily compare it with docker. Of course people can use runc directly, but most people select not to and use `docker run` instead.

And you can blame docker in a similar manner. LXC existed for at least 5 years before docker. But docker was just much more convenient to use for an average user.

UX is a huge factor for adoption of technology. If a project fails at creating the right interface, there is nothing wrong with creating a wrapper.

▲

samus 4 hours ago | parent | prev | next [-]

How about kobold.cpp then? Or LMStudio (I know it's not open source, but at least they give proper credit to llama.cpp)?

Re curation: they should strive to not integrate broken support for models and avoid uploading broken GGUFs.

▲

ekianjo 4 hours ago | parent | prev | next [-]

> For most users that wanted to run LLM locally, ollama solved the UX problem

This does not absolve them from the license violation

▲

well_ackshually 4 hours ago | parent | prev | next [-]

>solved the UX problem.

>One command

Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`, and that running things in the terminal is already a gigantic UX blocker (Ollama's popularity comes from the fact that it has a GUI), why are you putting the blame back on an open source project that owes you approximately zero communication ?

▲

zozbot234 4 hours ago | parent | next [-]

> Ollama's popularity comes from the fact that it has a GUI

It's not the GUI, it's the curated model hosting platform. Way easier to use than HF for casual users.

	▲	kgwgk 3 hours ago \| parent [-]
		It also made easy for casual users to think that they were running deepseek.

▲

Eisenstein 2 hours ago | parent | prev [-]

> Notwithstanding the fact that there's about zero difference between `ollama run model-name` and `llama-cpp -hf model-name`

There is a TON of difference. Ollama downloads the model from its own model library server, sticks it somewhere in your home folder with a hashed name and a proprietary configuration that doesn't use the in built metadata specified by the model creator. So you can't share it with any other tool, you can't change parameters like temp on the fly, and you are stuck with whatever quants they offer.

▲

amelius 4 hours ago | parent | prev | next [-]

Whip that llama! Oh wait, that's a different program.

	▲	mech422 4 hours ago \| parent [-]
		LOL https://www.youtube.com/watch?v=HaF-nRS_CWM

▲

croes 3 hours ago | parent | prev | next [-]

But if you’re just a GUI wrapper then at least attribute the library you created the GUI for

▲

FrozenSynapse 5 hours ago | parent | prev [-]

but if ollama is much slower, that's cutting on your fun and you'll be having better fun with a faster GUI

	▲	UqWBcuFx6NV4r 3 hours ago \| parent [-]
		You’ve completely missed the point.