| ▲ | hasperdi 5 hours ago | ||||||||||||||||||||||
I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow. For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times. | |||||||||||||||||||||||
| ▲ | mechagodzilla 4 hours ago | parent | next [-] | ||||||||||||||||||||||
I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | jwitthuhn 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over. | |||||||||||||||||||||||
| ▲ | hedgehog 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
Have you tried Qwen3 Next 80B? It may run a lot faster, though I don't know how well it does coding tasks. | |||||||||||||||||||||||
| ▲ | Reubend 4 hours ago | parent | prev [-] | ||||||||||||||||||||||
Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory. | |||||||||||||||||||||||