Remix clone Hacker News

new | show | ask | jobs Github

	▲	mft_ 5 hours ago
		I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?
	▲	l9o 5 hours ago \| parent \| next [-]
		RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.
	▲	deepsquirrelnet 5 hours ago \| parent \| prev \| next [-]
		For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage. You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.
	▲	aurohacker 4 hours ago \| parent \| prev \| next [-]
		Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU". As to its efficacy, I'd love to know...
	▲	noahbp 5 hours ago \| parent \| prev [-]
		It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass. Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.