Remix.run Logo
esperent 2 days ago

I don't know about GPT4 but the latest one (GPT 5.2) has 200k context window while Gemini has 1m, five times higher. You'll be wanting to stay within the first 100k on all of them to avoid hitting quotas very quickly though (either start a new task or compact when you reach that) so in practice there's no difference.

I've been cycling between a couple of $20 accounts to avoid running out of quota and the latest of all of them are great. I'd give GPT 5.2 codex the slight edge but not by a lot.

The latest Claude is about the same too but the limits on the $20 plan are too low for me to bother with.

The last week has made me realize how close these are to being commodities already. Even the CLI the agents are nearly the same bar some minor quirks (although I've hit more bugs in Gemini CLI but each time I can just save a checkpoint and restart).

The real differentiating factor right now is quota and cost.

mcv 13 hours ago | parent [-]

> You'll be wanting to stay within the first 100k on all of them

I must admit I have no idea how to do that or what that even means. I get that bigger context window is better, but what does it mean exactly? How do you stay within that first 100k? 100k what exactly?

hawtads 11 hours ago | parent [-]

Okay, here's the tl;dr:

Attention based neural network architectures (on which the majority of LLMs are built) has a unit economic cost that scales (roughly) n^2 i.e. quadratic (for both memory and compute). In other words, the longer the context window, the more expensive it is for the upstream provider. That's one cost.

The second cost is that you have to resend the entire context every time you send a new message. So the context is basically (where a, b, and c are messages): first context: a, second context window: a->b, third context window: a->b->c. It's a mostly stateless (there are some short term caching mechanisms, YMMV based on provider, it's why "cached" messages, especially system prompts are cheaper) process from the point of view of the developer, the state i.e. context window string is managed by the end user application (in other words, the coding agent, the IDE, the ChatGPT UI client etc.)

The per token cost is an amortized (averaged) cost of memory+compute, the actual cost is mostly quadratic with respect to each marginal token. The longer the context window the more expensive things are. Because of the above, AI agent providers (especially those that charge flat fee subscription plans) are incentivized to keep costs low by limiting the maximum context window size.

(And if you think about it carefully, your AI API costs are a quadratic cost curve projected into a linear line (flat fee per token, so the model hosting provider in some cases may make more profit if users send in shorter contexts, versus if they constantly saturate the window. YMMV of course, but it's a race to the bottom right now for LLM unit economics)

They do this by interrupting a task halfway through and generating a "summary" of the task progress, then they prompt the LLM again with a fresh prompt and the "summary" so far and the LLM will restart the task from where it left of. Of course text is a poor representation of the LLM's internal state but it's the best option so far for AI application to keep costs low.

Another thing to keep in mind is that LLMs have poorer performance the larger the input size. This is due to a variety of factors (mostly because you don't have enough training data to saturate the massive context window sizes I think).

The general graph for LLM context performance looks something like this: https://cobusgreyling.medium.com/llm-context-rot-28a6d039965... https://research.trychroma.com/context-rot

There are a bunch of tests and benchmarks (commonly referred to as "needle in a haystack") to improve the LLM performance at large context window sizes, but it's still an open area of research.

https://cloud.google.com/blog/products/ai-machine-learning/t...

The thing is, generally speaking, you will get a slightly better performance if you can squeeze all your code and problem into the context window, because the LLM can get a "whole picture" view of your codebase/problem, instead of a bunch of broken telephone summaries every dozen of thousands of tokens. Take this with a grain of salt as the field is changing rapidly so it might not be valid in a month or two.

Keep in mind that if the problem you are solving requires you to saturate the entire context window of the LLM, a single request can cost you dollars. And if you are using 1M+ context window model like gemini, you can rack up costs fairly rapidly.