Remix.run Logo
micah94 an hour ago

Just a personal anecdote but I have not hit any more thresholds or limits since switching to the MAX plan and so far, it's been worth it. But I do wonder how long even this will last...

ygjb an hour ago | parent [-]

I think subscription models are sustainable, but longer term, we should probably expect to see more prompt optimization happening in the providers inference pipeline. For example, unless you explicitly tell the agent or API to use a specific model, fronting the inference layer with a caching prompt classifier to determine which model to use, and automatically select the lowest cost model would probably already save alot of money (IDK if Claude/OpenAI do this on the backend, but several services I have worked on do some things like this to reduce costs of delivery customer facing inference at scale).

Majromax an hour ago | parent | next [-]

> fronting the inference layer with a caching prompt classifier to determine which model to use, and automatically select the lowest cost model would probably already save alot of money

Unfortunately, that doesn't work within a single session. The K-V cache of a model is intertwined with the model's configuration. Switching models invalidates the cache, meaning everything up to the point of the switchover is processed like a new, uncached input token.

Per Anthropic's pricing doc, an Opus 4.8 cache hit costs 50¢/MTok, while Haiku costs $1/MTok for uncached input.

Model selection works best if sessions are short and self-contained, particularly if the first few interactions can reliably classify the model need. That probably covers most 'support chatbot' use-cases, but it doesn't describe the kinds of heavy agentic automation that really chews through token budgets.

zozbot234 36 minutes ago | parent | next [-]

> The K-V cache of a model is intertwined with the model's configuration.

I don't think this is true if you simply quantize the model or run it with fewer active experts? The underlying weights would stay the same. You could also play further tricks with skipping some of the model's middle layers outright, which works surprisingly well due to how skip connections are used.

ygjb an hour ago | parent | prev [-]

There is a definite financial incentive for people smarter than me to solve the problem, and I don't generally bet against businesses finding ways to reduce costs :)

wahnfrieden an hour ago | parent | prev [-]

ChatGPT does this and codex will eventually. They’ve stated it’s the future.