| ▲ | daemonologist 8 hours ago | |
No - this model has the weights memory footprint of a 35B model (you do save a little bit on the KV cache, which will be smaller than the total size suggests). The lower number of active parameters gives you faster inference, including lower memory bandwidth utilization, which makes it viable to offload the weights for the experts onto slower memory. On a Mac, with unified memory, this doesn't really help you. (Unless you want to offload to nonvolatile storage, but it would still be painfully slow.) All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size. | ||