| ▲ | zihotki 17 hours ago | |
From personal experience - it works, but you won't get a comfortable time to first token (latency is high). The reason is that prefill on Macs is bad. You need to have a lot more cores to do it quick. It's close to instant for small models on NVidia GPU's but on Macs it takes a few seconds to get the answer for a simple prompt. And the time grows proportionally with your context size. | ||