| ▲ | Kostic 5 hours ago | |
For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription. Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding. Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode. Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so. EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable. | ||