| ▲ | barumrho 6 hours ago | |||||||
100 tok/s sounds pretty good. What do you get with 70B? With 128GB, you need quantization to fit 70B model, right? Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM. | ||||||||
| ▲ | super_mario 6 hours ago | parent [-] | |||||||
I run gpt-oss 120b model on ollama (the model is about 65 GB on disk) with 128k context size (the model is super optimized and only uses 4.8 GB of additional RAM for KV cache at this context size) on M4 Max 128 GB RAM Mac Studio and I get 65 tokens/s. | ||||||||
| ||||||||