Remix.run Logo
Aurornis 3 hours ago

I understand, but this isn't just a matter of not caching some experts. This is a 397B model on a device with 12GB of RAM. It's basically swapping experts out all the time, even if the distribution isn't uniform.

When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.

zozbot234 3 hours ago | parent [-]

"Individual experts" is a bit of a red-herring, what matters is expert-layers (this is the granularity of routing decisions), and these are small as mentioned by the original writeup. The filesystem cache does a tolerable job of keeping the "often used" ones around while evicting those that aren't needed (this is what their "Trust the OS" point is about). Of course they're also reducing the amount of active experts and quantizing a lot, AIUI this iPhone experiment uses Q1 and the MacBook was Q2.