Remix.run Logo
johnisgood 6 days ago

Hardware requirements?

"What you need" only includes software requirements.

DrAwdeOccarim 6 days ago | parent | next [-]

The author says 36GB unified ram in the article. I run the same memory M3 Pro and LM Studio daily with various models up to the 30b parameter one listed and it flies. Can’t differentiate between my OAi chats vs locals aside from modern context, though I have puppeteer MCP which works well for web search and site-reading.

jszymborski 6 days ago | parent | prev | next [-]

30B runs at a reasonable speed on my desktop which has an RTX 2080 (8gb VRAM) and 32Gb of RAM.

Havoc 6 days ago | parent | prev | next [-]

30B class model should run on a consumer 24gb card when quantised though would need pretty aggressive quant to make room for context. Don’t think you’ll get the full 256k context though

So about 700 bucks for a 3090 on eBay

magicalhippo 6 days ago | parent [-]

I have a 5070 Ti and a 2080 Ti, but running Windows so roughly 25-26 GB available. With Flash Attention enabled, I can just about squeeze in Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL from Unsloth with 64k context entirely on the GPUs.

With a 3090 I guess you'd have to reduce context or go for a slightly more aggressive quantization level.

Summarizing llama-arch.cpp which is roughly 40k tokens I get ~50 tok/sec generation speed and ~14 seconds to first token.

For short prompts I get more like ~90 tok/sec and <1 sec to first token.

thecolorblue 6 days ago | parent | prev [-]

I am running it on an M1 Max.