▲ | bastawhiz 5 days ago | |||||||
There's more to it, though. The inference code you linked to is Python. Unless my software is Python, I have to ship a CPython binary to run the inference code, then wire it up (or port it, if you're feeling spicy). Ollama brings value by exposing an API (literally over sockets) with many client SDKs. You don't even need the SDKs to use it effectively. If you're writing Node or PHP or Elixir or Clojurescript or whatever else you enjoy, you're probably covered. It also means that you can swap models trivially, since you're essentially using the same API for each one. You never need to worry about dependency hell or the issues involved in hosting more than one model at a time. As far as I know, Ollama is really the only solution that does this. Or at the very least, it's the most mature. | ||||||||
▲ | refulgentis 5 days ago | parent [-] | |||||||
The relationship between Ollama and llama.cpp is massively closer than it must seem. Ollama is llama.cpp with a nice little installer GUI and nice little server binary. llama.cpp has a server binary as well, however, no nice installer GUI. The only time recently Ollama had a feature llama.cpp didn't was they patched SWA in with Google, llama.cpp had it a couple weeks later. Ollama is significantly behind llama.cpp in important areas, ex. the Gemma blog post, they note they'll get on tool calls and multimodal real soon now. | ||||||||
|