| ▲ | barrkel 3 hours ago | |
I found it interesting that vLLM was dismissed as slower than llama.cpp. IME vLLM is quite a bit faster than llama.cpp but where it really wipes the floor with it is in batching concurrent load. The downside is that it is dramatically less flexible in terms of tweaking. It gives you very few options for running quantized weights. It takes a lot longer to start up because it optimizes the compute graph. So for single user experimentation on a model that's a bit too big for your box, vLLM is just going to be frustrating. | ||
| ▲ | chartered_stack 2 hours ago | parent | next [-] | |
One could say: vLLM isn't a worse Llama.cpp, it's a different tool | ||
| ▲ | krzyk an hour ago | parent | prev [-] | |
AFAIR the general consensus is (was?): - llama.cpp for single user - vLLM for multi-user (e.g. enterprises) They are similar, but for different use cases. | ||