100 tok/s sounds pretty good. What do you get with 70B? With 128GB, you need quantization to fit 70B model, right?

Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM.

▲ super_mario 6 hours ago | parent [-]

I run gpt-oss 120b model on ollama (the model is about 65 GB on disk) with 128k context size (the model is super optimized and only uses 4.8 GB of additional RAM for KV cache at this context size) on M4 Max 128 GB RAM Mac Studio and I get 65 tokens/s.

▲ abhikul0 5 hours ago | parent [-]

Have you tried the dense(27B,9B) Qwen3.5 models? Or any diffusion models (Flux Klein, Zimage)? I'm trying to gauge how much of a perf boost I'd get upgrading from an m3 pro.

For reference:

  | model                          |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | qwen35 ?B Q5_K - Medium        |   6.12 GiB |     8.95 B | MTL,BLAS   |       6 |           pp512 |        288.90 ± 0.67 |
  | qwen35 ?B Q5_K - Medium        |   6.12 GiB |     8.95 B | MTL,BLAS   |       6 |           tg128 |         16.58 ± 0.05 |

  | model                          |       size |     params | backend    | threads |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       6 |           pp512 |        615.94 ± 2.23 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | MTL,BLAS   |       6 |           tg128 |         42.85 ± 0.61 |

  Klein 4B completes a 1024px generation in 72seconds.