Remix.run Logo
zozbot234 4 hours ago

SSD offload is always a possibility with good software support. Of course you might easily object that the model would not be "running" then, more like crawling. Still you'd be able to execute it locally and get it to respond after some time.

Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.

adrian_b 2 hours ago | parent [-]

For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.

QuantumNomad_ 2 hours ago | parent | next [-]

Three hour coffee break while the LLM prepares scaffolding for the project.

cyanydeez an hour ago | parent [-]

Or ship your plan to your indian counterpart and wake up in the morning with about the same amount of causal errors.

dcreater 3 minutes ago | parent [-]

@dang

zozbot234 an hour ago | parent | prev [-]

Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.