| ▲ | deepsquirrelnet 5 hours ago | |
For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage. You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size. | ||