| ▲ | zozbot234 3 hours ago | |||||||
The writeup from the earlier experiment (running on a MacBook Pro) shows quite clearly that expert routing choices are far from uniform, and that some layer-experts are only used rarely. So you can save some RAM footprint even while swapping quite rarely. | ||||||||
| ▲ | Aurornis 3 hours ago | parent [-] | |||||||
I understand, but this isn't just a matter of not caching some experts. This is a 397B model on a device with 12GB of RAM. It's basically swapping experts out all the time, even if the distribution isn't uniform. When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option. | ||||||||
| ||||||||