> Though iPhone Pro has very limited RAM (12GB total) which you still need for the active part of the model.

This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.

▲

zozbot234 4 hours ago | parent [-]

Yes but most people are still running MoE models with all experts loaded in RAM! This experiment shows quite clearly that some experts are only rarely needed, so you do benefit from not caching every single expert-layer in RAM at all times.

▲

Aurornis 3 hours ago | parent | next [-]

That's not what this test shows. It's just loading the parts of the model that are used in an on-demand fashion from flash.

The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.

If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.

▲

QuantumNomad_ an hour ago | parent | next [-]

If I only use an LLM to ask questions about programming in one specific programming language, can I distill away other experts and get all the answers I need from a single expert? Or is it still different experts that end up handling the question depending on what else is in the question? For example, if I say “plan a static web server in Rust” it might use expert A for that, but if I say “implement a guessing game in Rust” it might use expert B, and so on?

▲

zozbot234 3 hours ago | parent | prev [-]

The writeup from the earlier experiment (running on a MacBook Pro) shows quite clearly that expert routing choices are far from uniform, and that some layer-experts are only used rarely. So you can save some RAM footprint even while swapping quite rarely.

▲

Aurornis 3 hours ago | parent [-]

I understand, but this isn't just a matter of not caching some experts. This is a 397B model on a device with 12GB of RAM. It's basically swapping experts out all the time, even if the distribution isn't uniform.

When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.

	▲	zozbot234 3 hours ago \| parent [-]
		"Individual experts" is a bit of a red-herring, what matters is expert-layers (this is the granularity of routing decisions), and these are small as mentioned by the original writeup. The filesystem cache does a tolerable job of keeping the "often used" ones around while evicting those that aren't needed (this is what their "Trust the OS" point is about). Of course they're also reducing the amount of active experts and quantizing a lot, AIUI this iPhone experiment uses Q1 and the MacBook was Q2.

▲

MillionOClock 2 hours ago | parent | prev | next [-]

I hope some company trains their models so that expert switches are less often necessary just for these use cases.

	▲	zozbot234 2 hours ago \| parent [-]
		A model "where expert switches are less necessary" is hard to tell apart from a model that just has fewer total experts. I'm not sure whether that will be a good approach. "How often to switch" also depends on how much excess RAM has been available in the system to keep layers opportunistically cached from the previous token(s). There's no one-size fits all decision.

▲

jnovek 3 hours ago | parent | prev [-]

I’m so confused in these comments right now — I thought you had to load an entire MoE model and sparseness just made it so you can traverse the model more quickly.