Remix.run Logo
benob 3 hours ago

I get ~5 tokens/s on an M4 with 32G of RAM, using:

  llama-server \
   -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
   --no-mmproj \
   --fit on \
   -np 1 \
   -c 65536 \
   --cache-ram 4096 -ctxcp 2 \
   --jinja \
   --temp 0.6 \
   --top-p 0.95 \
   --top-k 20 \
   --min-p 0.0 \
   --presence-penalty 0.0 \
   --repeat-penalty 1.0 \
   --reasoning on \
   --chat-template-kwargs '{"preserve_thinking": true}'
35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.

I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models.

danielhanchen 2 hours ago | parent | next [-]

We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace.

https://huggingface.co/unsloth/Qwen3.6-27B-UD-MLX-4bit

DarmokJalad1701 17 minutes ago | parent [-]

What exactly does the .sh file install? How does it compare to running the same model in, say, omlx?

dunb 2 hours ago | parent | prev | next [-]

Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.

wuschel 38 minutes ago | parent | prev | next [-]

How is the quality of model answers to your queries? Are they stable over time?

I am wondering how to measure that anyway.

kpw94 an hour ago | parent | prev [-]

When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s?

(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)

zargon 44 minutes ago | parent [-]

If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill.

kpw94 29 minutes ago | parent [-]

But isn't the prefill speed the bottleneck in some systems* ?

Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort.

But point taken that the parent numbers are probably decode

* Specifically, Mac metal, which is what parent numbers are about