▲ | cahaya 4 days ago | |||||||
Nice. Seems like i cannot run this on my Apple silicon M chips right? | ||||||||
▲ | poorman 3 days ago | parent | next [-] | |||||||
If you have 64 GB of RAM you should be able to run the 4-bit quantized mlx models, which are specifically for the Apple silicon M chips. https://huggingface.co/collections/mlx-community/qwen3-next-... | ||||||||
| ||||||||
▲ | tripplyons 3 days ago | parent | prev | next [-] | |||||||
I have can host it on my M3 laptop somewhere around 30-40 tokens per second using mlx_lm's server command: mlx_lm.server --model mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit --trust-remote-code --port 4444 I'm not sure if there is support for Qwen3-Next in any releases yet, but when I set up the python environment I had to install mlx_lm from source. | ||||||||
▲ | mhuffman 3 days ago | parent | prev | next [-] | |||||||
This particular one may not work on M chips, but the model itself does. I just tested a different sized version of the same model in LM Studio on a Macbook Pro, 64GB M2 Max with 12 cores, just to see. Prompt: Create a solar system simulation in a single self-contained HTML file. qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token - note: looked like ass, simulation broken, didn't work at all. Then as a comparison for a model with a similar size, I tried GLM. GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token - note: looked fantastic for a first try, everything worked as expected. Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use. | ||||||||
▲ | jasonjmcghee 4 days ago | parent | prev | next [-] | |||||||
Depends how much ram yours has. Get a 4bit quant and it'll fit in ~40-50GB depending on context window. And it'll run at like 40t/s depending on which one you have | ||||||||
▲ | anuarsh 3 days ago | parent | prev [-] | |||||||
I haven't tested on Apple machines yet, but gpt-oss and qwen3-next should work I assume. Llama3 versions use cuda specific loading logic for speed boost, so it won't work for sure |