> Is not far at all from proprietary models if you give it tools, skills and agents etc,

I use Qwen 3.6 27B, the dense version of this model which is slightly better.

I don't agree that it's close at all. Maybe for some small, easy tasks, but not for working on real codebases. It's amazing for something I can run at home, but the difference between it and Opus or GPT-5.5 is huge.

▲ trilogic 3 hours ago | parent | next [-]

Really, how so? Because we work with codebases daily, can you tell us a concrete example! In our case we work in consumer hardware (ish), 10 million ctx (1 million output, 1 million input proven, sometimes it loops or breaks at over 500k ctx byt at ~17tps linear). IT can read the full codebase, unleash agents, and write in disk editing and patching files creating a full app in 3-4 minutes. IT can do Web search and Rag pretty fast, it understands and fix the user query, sys prompts and adapt/fix them if needed on the fly. I am wondering what more do you do?

▲ trilogic 3 hours ago | parent [-]

Edit: Forgot to mention that it can process images and pdf, and 100s of other files, it can even create presentations in code or mermaid, svg, charts js etc. Here a basic version of it: https://hugston.com/chat

▲ rspoerri 3 hours ago | parent [-]

how do you do 1mio context with qwen3.6 27b, that only supports 256k? and what hardware would you run that on? 2 * 3090 is afaik currently at max 256k context.

▲ nyrikki 2 hours ago | parent | next [-]

You can get all the Qwen 3.x models up to ~1 million tokens using YaRN with llama.cpp.[0]

Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level.

I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad.

But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well.

But play with YaRN if you really need it.

[0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...

▲ Vaskivo an hour ago | parent [-]

How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.

HEre's my setup:

  llama-server
  --port 9999
  --model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf
  --ctx-size 128000
  --threads 12
  --flash-attn on
  --device CUDA0
  --jinja
  --gpu-layers 52
  --mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf
  --cache-type-k q8_0
  --cache-type-v q8_0
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
  --spec-type draft-mtp --spec-draft-n-max 2

(I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)

▲ nyrikki 32 minutes ago | parent [-]

(Note UPDATED config)

Ya, if you are using the CPU it may slowdown quick.

This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.

    podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 \
    --ctx-size 131072 \
    --no-mmproj-offload \
    --no-context-shift \
    --kv-unified \
    --spec-type draft-mtp \
    --spec-draft-n-max 6 \
    --spec-draft-p-min 0.75 \
    -fa on --jinja --no-mmap \
    --cache-ram -1 \
    --no-warmup -np 1 \
    -n 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --min-p 0.00 \
    --top-k 20 \
    --top-p 0.95 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.05 \
    --fit off \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --prio 3 \
    --poll 100 \
    --port 8080 \
    --host 0.0.0.0

I am just building the container with:

     podman build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .

And here is the logs from a 'make me a flappy bird program in python' webui prompt.

     prompt eval time =     105.86 ms /    19 tokens (    5.57 ms per token,   179.47 tokens per second)
       eval time =  100549.41 ms /  4608 tokens (   21.82 ms per token,    45.83 tokens per second)
      total time =  100655.28 ms /  4627 tokens
     draft acceptance rate = 0.47215 ( 3408 accepted /  7218 generated)

I am down to ~25.54 t/s with a 95% full context.

▲ nyrikki 5 minutes ago | parent [-]

That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s

I think that was all about some earlier crashes.

     podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 \
    --ctx-size 128000 \
    --no-mmproj-offload \
    --no-context-shift \
    --kv-unified \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    --spec-draft-p-min 0.75 \
    -fa on --jinja --no-mmap \
    --cache-ram -1 \
    --no-warmup -np 1\
    -n 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --min-p 0.00 \
    --top-k 20 \
    --top-p 0.95 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.05 \
    --fit off \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --port 8080 \
    --host 0.0.0.0

▲ omneity 3 hours ago | parent | prev | next [-]

You can increase the context window beyond its max trained context using RoPE scaling[0] which will require more VRAM.

But you can increase your context window for the same VRAM by quantizing the KV cache with FP8 (double the context) or TurboQuant (more than double)[1].

0: https://medium.com/@leannetan/extending-context-length-with-...

1: https://docs.vllm.ai/en/latest/features/quantization/quantiz...

▲ trilogic 2 hours ago | parent | prev [-]

We managed to increase the ctx for whatever llm model that is GGUFED, here the experimental tests: https://www.reddit.com/r/Hugston/

▲ 0xbadcafebee 20 minutes ago | parent | prev | next [-]

> not for working on real codebases

You don't pick just one model to "work on real codebases". You use a very advanced model to plan, and a not-very-advanced, cheaper, faster model to execute planned tasks. This saves money and speeds up work. This is the guidance from Anthropic & OpenAI.

▲ tedivm 3 hours ago | parent | prev [-]

I've had the opposite experience, and have built multiple fantastic applications with Qwen3.6 27b. What quantization have you tested with?

	▲	trilogic 3 hours ago \| parent \| next [-]
		As funny as it may sound a q4_k_m well converted and quantized properly (and finetuned, impereative) would do the job. The 27b it may be good but is heavy, it burns the hardware. I personally prefer the 397B if I am stucked and can´t progress, it can still run with 7 tps. Now with the Mtp (multitoken prediction) it nearly double the speed ( reached 82tps today with the 35b 100000ctx). I recommend it you give it a try.
	▲	hedgehog 3 hours ago \| parent \| prev [-]
		Similarly I haven't seen Qwen 27B as remotely competitive with Opus, at least Q4 hooked up to Claude Code. What harness are you using?