> I would never want to use something like ollama in a production setting.

We benchmarked vLLM and Ollama on both startup time and tokens per seconds. Ollama comes at the top. We hope to be able to publish these results soon.

▲

ekianjo 7 days ago | parent | next [-]

you need to benchmark against llama.cpp as well.

▲

apitman 7 days ago | parent | prev | next [-]

Did you test multi-user cases?

	▲	jasonjmcghee 7 days ago \| parent [-]
		Assuming this is equivalent to parallel sessions, I would hope so, this is like the entire point of vLLM

▲

sbinnee 6 days ago | parent | prev [-]

vllm and ollama assume different settings and hardware. Vllm backed by the paged attention expect a lot of requests from multiple users whereas ollama is usually for single user on a local machine.