| ▲ | ygjb 2 hours ago | |||||||||||||
I think subscription models are sustainable, but longer term, we should probably expect to see more prompt optimization happening in the providers inference pipeline. For example, unless you explicitly tell the agent or API to use a specific model, fronting the inference layer with a caching prompt classifier to determine which model to use, and automatically select the lowest cost model would probably already save alot of money (IDK if Claude/OpenAI do this on the backend, but several services I have worked on do some things like this to reduce costs of delivery customer facing inference at scale). | ||||||||||||||
| ▲ | Majromax 2 hours ago | parent | next [-] | |||||||||||||
> fronting the inference layer with a caching prompt classifier to determine which model to use, and automatically select the lowest cost model would probably already save alot of money Unfortunately, that doesn't work within a single session. The K-V cache of a model is intertwined with the model's configuration. Switching models invalidates the cache, meaning everything up to the point of the switchover is processed like a new, uncached input token. Per Anthropic's pricing doc, an Opus 4.8 cache hit costs 50¢/MTok, while Haiku costs $1/MTok for uncached input. Model selection works best if sessions are short and self-contained, particularly if the first few interactions can reliably classify the model need. That probably covers most 'support chatbot' use-cases, but it doesn't describe the kinds of heavy agentic automation that really chews through token budgets. | ||||||||||||||
| ||||||||||||||
| ▲ | wahnfrieden 2 hours ago | parent | prev [-] | |||||||||||||
ChatGPT does this and codex will eventually. They’ve stated it’s the future. | ||||||||||||||