Remix.run Logo
giancarlostoro 4 hours ago

> (starts to get a bit dumb above 160k ish)

If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.

StevenWaterman 3 hours ago | parent | next [-]

I think we'll get there. Right now it works for me, because I'm naturally pretty verbose in my prompts, and know the codebase well, so I know what it needs to look at. Plus subagents for anything exploratory.

I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know

Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.

In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)

zozbot234 3 hours ago | parent [-]

DeepSeek V4 (both Flash and Pro) has very good scaling of context length wrt. RAM use, so this is not an inherent limit of LLMs in general.

0xc133 3 hours ago | parent | prev | next [-]

With yarn and rope scaling arguments for llama.cpp you could run qwen3.6-27B with 1M context… if you have enough memory to store it.

cyanydeez 3 hours ago | parent | prev [-]

I don't really think you're making reasonable decisions at that size; but I suppose if you're not allowed to refactor it, maybe.

I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.

Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.

Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.