Remix clone Hacker News

new | show | ask | jobs Github

	▲	deepsquirrelnet 5 hours ago
		For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage. You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.