| ▲ | adrian_b 2 hours ago | ||||||||||||||||
For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data. | |||||||||||||||||
| ▲ | QuantumNomad_ 2 hours ago | parent | next [-] | ||||||||||||||||
Three hour coffee break while the LLM prepares scaffolding for the project. | |||||||||||||||||
| |||||||||||||||||
| ▲ | zozbot234 2 hours ago | parent | prev [-] | ||||||||||||||||
Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed. | |||||||||||||||||