| ▲ | boutell 6 hours ago | |||||||
That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it: https://github.com/noonghunna/qwen36-27b-single-3090 Flies though (50-70tps is impressive for a model this smart) I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth. | ||||||||
| ▲ | stymaar 6 hours ago | parent | next [-] | |||||||
> That 3090 is going to burn 750W The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1. | ||||||||
| ||||||||
| ▲ | hughw 4 hours ago | parent | prev [-] | |||||||
My eyes glaze over reading all the AI produced verbiage. I did find a few useful parameter settings I've already discovered using my single 3090 and ollama. I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps. [edited to mention ollama as a nice alt] | ||||||||