Remix clone Hacker News

new | show | ask | jobs Github

	▲	fy20 3 hours ago
		Running it on a Macbook Pro M5 48GB: `-hf unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL \ -c 128000 \ --parallel 1 \ --flash-attn on \ --no-context-shift \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence_penalty 0.0 \ --reasoning on \ --jinja \ --chat-template-kwargs "{\"preserve_thinking\": true}" \ --spec-type ngram-simple \ --draft-max 64 \ --timeout 1800` Maybe someone knows any tips to optimise prompt processing as that's the slowest part? It takes a few minutes before OpenCode with ~20k initial context first responds, but subsequent responses are pretty fast due to caching.
	▲	sleepyeldrazi 39 minutes ago \| parent \| next [-]
		I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like. I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx). Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill. I would be very interested to know your prefill and generation numbers.
	▲	jonaustin 2 hours ago \| parent \| prev [-]
		https://github.com/jundot/omlx note: 27b is going to be slow; use the 35b MoE if you want decent token/sec speed.