Depends on how heavy one wants to go with the quants (for Q6-Q4 the AMD Ryzen AI MAX chips seem better/cheaper way to get started).

Also the Mac Studio is a bit hampered by its low compute-power, meaning you really can't use a 100b+ dense model, only MoE feasibly without getting multi minute prompt-processing times (assuming 500+ tokens etc.)

▲ GeekyBear 3 days ago | parent | next [-]

Given the RAM limitations of the first gen Ryzen AI MAX, you have no choice but to go heavy on the quantization of the larger LLMs on that hardware.

▲ mercutio2 3 days ago | parent | prev [-]

Huh? My maxed out Mac Studio gets 60-100 tokens per second on 120B models, with latency on the order of 2 seconds.

It was expensive, but slow it is not for small queries.

Now, if I want to bump the context window to something huge, it does take 10-20 seconds to respond for agent tasks, but it’s only 2-3x slower than paid cloud models, in my experience.

Still a little annoying, and the models aren’t as good, but the gap isn’t nearly as big as you imply, at least for me.

▲ zargon 3 days ago | parent | next [-]

GPT OSS 120B only has 5B active parameters. GP specifically said dense models, not MoE.

▲ EnPissant 3 days ago | parent | prev | next [-]

I think the Mac Studio is a poor fit for gpt-oss-120b.

On my 96 GB DDR5-6000 + RTX 5090 box, I see ~20s prefill latency for a 65k prompt and ~40 tok/s decode, even with most experts on the CPU.

A Mac Studio will decode faster than that, but prefill will be 10s of times slower due to much lower raw compute vs a high-end GPU. For long prompts that can make it effectively unusable. That’s what the parent was getting at. You will hit this long before 65k context.

If you have time, could you share numbers for something like:

llama-bench -m <path-to-gpt-oss-120b.gguf> -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096

Edit: The only Mac Studio pp65536 datapoint I’ve found is this Reddit thread:

https://old.reddit.com/r/LocalLLaMA/comments/1jq13ik/mac_stu ...

They report ~43.2 minutes prefill latency for a 65k prompt on a 2-bit DeepSeek quant. Gpt-oss-120b should be faster than that, but still very slow.

▲ int_19h 3 days ago | parent [-]

This is Mac Studio M1 Ultra with 128Gb of RAM.

  > llama-bench -m ./gpt-oss-120b-MXFP4-00001-of-00002.gguf -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096       
                                                                                             
  | model                          |       size |     params | backend    | threads | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | -------: | -: | ---: | --------------: | -------------------: |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |         pp65536 |       392.37 ± 43.91 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |           tg128 |         65.47 ± 0.08 |
  
  build: a0e13dcb (6470)

	▲	EnPissant 2 days ago \| parent [-]
		Thanks. That’s better than I expected. It's only 8.3x worse than a 5090 + CPU: 167s latency.

▲ 3 days ago | parent | prev [-]

[deleted]