Remix.run Logo
cahaya 4 days ago

Nice. Seems like i cannot run this on my Apple silicon M chips right?

poorman 3 days ago | parent | next [-]

If you have 64 GB of RAM you should be able to run the 4-bit quantized mlx models, which are specifically for the Apple silicon M chips. https://huggingface.co/collections/mlx-community/qwen3-next-...

cahaya 3 days ago | parent [-]

Got 32GB so was hoping I could use ollm to offload it to my SSD. Slower but making it possible to run bigger models (in emergencies)

tripplyons 3 days ago | parent | prev | next [-]

I have can host it on my M3 laptop somewhere around 30-40 tokens per second using mlx_lm's server command:

mlx_lm.server --model mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit --trust-remote-code --port 4444

I'm not sure if there is support for Qwen3-Next in any releases yet, but when I set up the python environment I had to install mlx_lm from source.

mhuffman 3 days ago | parent | prev | next [-]

This particular one may not work on M chips, but the model itself does. I just tested a different sized version of the same model in LM Studio on a Macbook Pro, 64GB M2 Max with 12 cores, just to see.

Prompt: Create a solar system simulation in a single self-contained HTML file.

qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token

- note: looked like ass, simulation broken, didn't work at all.

Then as a comparison for a model with a similar size, I tried GLM.

GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token

- note: looked fantastic for a first try, everything worked as expected.

Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use.

jasonjmcghee 4 days ago | parent | prev | next [-]

Depends how much ram yours has. Get a 4bit quant and it'll fit in ~40-50GB depending on context window.

And it'll run at like 40t/s depending on which one you have

anuarsh 3 days ago | parent | prev [-]

I haven't tested on Apple machines yet, but gpt-oss and qwen3-next should work I assume. Llama3 versions use cuda specific loading logic for speed boost, so it won't work for sure