| ▲ | xscott 16 hours ago | |||||||||||||
Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that). However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless. | ||||||||||||||
| ▲ | iib 15 hours ago | parent [-] | |||||||||||||
Doesn't llama.cpp (or similar) have to evict the kv cache for this, so that performance is degraded when running multiple sessions? Or how do you load a model in memory and then use it in multiple sessions? I am still learning this stuff | ||||||||||||||
| ||||||||||||||