| ▲ | raron 5 hours ago | |
How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction? | ||
| ▲ | throwdbaaway 4 hours ago | parent | next [-] | |
Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache With this much cheaper setup backed by disks, they can offer much better caching experience: > Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days. | ||
| ▲ | cyanydeez 4 hours ago | parent | prev [-] | |
I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done. The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey. | ||