| ▲ | benob 3 hours ago | ||||||||||||||||
I get ~5 tokens/s on an M4 with 32G of RAM, using:
35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models. | |||||||||||||||||
| ▲ | danielhanchen 2 hours ago | parent | next [-] | ||||||||||||||||
We also made some dynamic MLX ones if they help - it might be faster for Macs, but llama-server definitely is improving at a fast pace. | |||||||||||||||||
| |||||||||||||||||
| ▲ | dunb 2 hours ago | parent | prev | next [-] | ||||||||||||||||
Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results. | |||||||||||||||||
| ▲ | wuschel 38 minutes ago | parent | prev | next [-] | ||||||||||||||||
How is the quality of model answers to your queries? Are they stable over time? I am wondering how to measure that anyway. | |||||||||||||||||
| ▲ | kpw94 an hour ago | parent | prev [-] | ||||||||||||||||
When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s? (Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore) | |||||||||||||||||
| |||||||||||||||||