Remix clone Hacker News

new | show | ask | jobs Github

	▲	zihotki 17 hours ago
		From personal experience - it works, but you won't get a comfortable time to first token (latency is high). The reason is that prefill on Macs is bad. You need to have a lot more cores to do it quick. It's close to instant for small models on NVidia GPU's but on Macs it takes a few seconds to get the answer for a simple prompt. And the time grows proportionally with your context size.