Remix.run Logo
tptacek 2 months ago

Have they documented the context window changes for Claude 4 anywhere? My (barely informed) understanding was one of the reasons Gemini 2.5 has been so useful is that it can handle huge amounts of context --- 50-70kloc?

minimaxir 2 months ago | parent | next [-]

Context window is unchanged for Sonnet. (200k in/64k out): https://docs.anthropic.com/en/docs/about-claude/models/overv...

In practice, the 1M context of Gemini 2.5 isn't that much of a differentiator because larger context has diminishing returns on adherence to later tokens.

Rudybega 2 months ago | parent | next [-]

I'm going to have to heavily disagree. Gemini 2.5 Pro has super impressive performance on large context problems. I routinely drive it up to 4-500k tokens in my coding agent. It's the only model where that much context produces even remotely useful results.

I think it also crushes most of the benchmarks for long context performance. I believe on MRCR (multi round coreference resolution) it beats pretty much any other model's performance at 128k at 1M tokens (o3 may have changed this).

alasano 2 months ago | parent | next [-]

I find that it consistently breaks around that exact range you specified. In the sense that reliability falls off a cliff, even though I've used it successfully close to the 1M token limit.

At 500k+ I will define a task and it will suddenly panic and go back to a previous task that we just fully completed.

vharish 2 months ago | parent | prev | next [-]

Totally agreed on this. The context size is what made me switch to Gemini. Compared to Gemini, Claude's context window length is a joke.

Particularly for indie projects, you can essentially dump the entire code into it and with pro reasoning model, it's all handled pretty well.

tsurba 2 months ago | parent [-]

Yet somehow chatting with Gemini in the web interface, it forgets everything after 3 messages, while GPT (almost) always feels natural in long back-and-forths. It’s been like this for at least a year.

wglb 2 months ago | parent [-]

My experience has been different. I worked with it to disgnose two different problems. On the last one I counted questions and answers. It was 15.

egamirorrim 2 months ago | parent | prev | next [-]

OOI what coding agent are you managing to get to work nicely with G2.5 Pro?

Rudybega 2 months ago | parent [-]

I mostly use Roo Code inside visual studio. The modes are awesome for managing context length within a discrete unit of work.

l2silver 2 months ago | parent | prev [-]

Is that a codebase you're giving it?

zamadatix 2 months ago | parent | prev | next [-]

The amount of degradation at a given context length isn't constant though so a model with 5x the context can either be completely useless or still better depending on the strength of the models you're comparing. Gemini actually does really great in both regards (context length and quality at length) but I'm not sure what a hard numbers comparison to the latest Claude models would look like.

A good deep dive on the context scaling topic in general https://youtu.be/NHMJ9mqKeMQ

Workaccount2 2 months ago | parent | prev | next [-]

Gemini's real strength is that it can stay on the ball even at 100k tokens in context.

michaelbrave 2 months ago | parent | prev | next [-]

I've had a lot of fun using Gemini's large context. I scrape a reddit discussion with 7k responses, and have gemini synthesize it and categorize it, and by the time it's done and I have a few back and fourths with it I've gotten half of a book written.

That said I have noticed that if I try to give it additional threads to compare and contrast once it hits around the 300-500k tokens it starts to hallucinate more and forget things more.

ashirviskas 2 months ago | parent | prev | next [-]

It's closer to <30k before performance degrades too much for 3.5/3.7. 200k/64k is meaningless in this context.

jerjerjer 2 months ago | parent [-]

Is there a benchmark to measure real effective context length?

Sure, gpt-4o has a context window of 128k, but it loses a lot from the beginning/middle.

brookst 2 months ago | parent | next [-]

Here's an older study that includes Claude 3.5: https://www.databricks.com/blog/long-context-rag-capabilitie...?

evertedsphere 2 months ago | parent | prev | next [-]

ruler https://arxiv.org/abs/2404.06654

nolima https://arxiv.org/abs/2502.05167

bigmadshoe 2 months ago | parent | prev [-]

They often publish "needle in a haystack" benchmarks that look very good, but my subjective experience with a large context is always bad. Maybe we need better benchmarks.

strangescript 2 months ago | parent | prev | next [-]

Yeah, but why aren't they attacking that problem? Is it just impossible, because it would be a really simple win with regards to coding. I am huge enthusiast, but I am starting to feel a peak.

VeejayRampay 2 months ago | parent | prev [-]

that is just not correct, it's a big differentiator

fblp 2 months ago | parent | prev | next [-]

I wish they would increase the context window or better handle it when the prompt gets too long. Currently users get "prompt is too long" warnings suddenly which makes it a frustrating model to work with for long conversations, writing etc.

Other tools may drop some prior context, or use RAG to help but they don't force you to start a new chat without warning.

jbellis 2 months ago | parent | prev | next [-]

not sure wym, it's in the headline of the article that Opus 4 has 200k context

(same as sonnet 3.7 with the beta header)

esafak 2 months ago | parent | next [-]

There's the nominal context length, and the effective one. You need a benchmark like the needle-in-a-haystack or RULER to determine the latter.

https://github.com/NVIDIA/RULER

tptacek 2 months ago | parent | prev [-]

We might be looking at different articles? The string "200" appears nowhere in this one --- or I'm just wrong! But thanks!

jbellis 2 months ago | parent [-]

My mistake, I was in fact looking at one of the linked details pages

keeganpoppen 2 months ago | parent | prev [-]

context window size is super fake. if you don't have the right context, you don't get good output.