Remix clone Hacker News

new | show | ask | jobs Github

	▲	booty an hour ago
		`I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.` The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance. (not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot) Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.
	▲	mark_l_watson 37 minutes ago \| parent [-]
		You are using Q6 6 bit quantization; on my 32G MacMini I use Q4 and it is faster but when I use it with OpenCode, I set up a task and go outside to walk for ten minutes. Smart, capable, and slow. Still, I love using local models. EDIT: I run with context wired at 64K