| ▲ | cl0ckt0wer 2 hours ago | |
llm intelligence seems to be proportional to the ram used. All techniques like this will be used by everyone. | ||
| ▲ | zozbot234 35 minutes ago | parent [-] | |
You can almost always use less RAM by making inference slower. Streaming MoE active weights from SSD is an especially effective variety of this, but even with a large dense model, you could run inference on a layer-wise basis (perhaps coalescing only a few layers at a time) if the model on its own is too large for your RAM. You need to store the KV-cache, but that takes only modest space and at least for ordinary transformers (no linear attention tricks) is append-only, which fits well with writing it to SSD (AIUI, this is also how "cached" prompts/conversations work under the hood). | ||