> I think there is a moderately large market for models like this that aren’t quite SOTA level but can be served up much cheaper.

There isn't, pretty much everyone wants the best of the best.

▲

PhilippGille 7 hours ago | parent | next [-]

The OpenRouter usage stats indicate the opposite: https://openrouter.ai/rankings?view=month

▲

jjice 6 hours ago | parent | next [-]

OpenRouter usage is likely skewed towards LLMs that are more niche and/or self-hostable by solid hardware that's available, but most consumers don't have on hand. I can imagine Anthropic and OpenAI LLMs often get called directly from their APIs instead.

At least from my experience and friends of mine, we use OpenRouter for cases where we want to use smaller LLMs like Qwen, but when I've used ChatGPT and Claude, I use those APIs directly.

	▲	elbear 3 hours ago \| parent \| next [-]
		I use ChatGPT and Claude on OpenRouter, because it's just easier than buying credits on each platform separately.
	▲	senordevnyc 5 hours ago \| parent \| prev [-]
		Same, and my little SaaS is pushing more than 0.1% of the TOTAL volume of tokens on OpenRouter, so the reality is they’re TINY.

▲

vorticalbox 6 hours ago | parent | prev [-]

what happened around jan this year(26) that caused such a climb in usage?

	▲	wcallahan 4 hours ago \| parent [-]
		Openclaw

▲

thraxil 6 hours ago | parent | prev | next [-]

No. Right now I'm upset that Google has removed (or at least is in the process of removing) the Gemini 2.0 flash model. We use it for some pretty basic functionality because it's cheap and fast and honestly good enough for what we use it for in that part of our app. We're being forced to "upgrade" to models that are at least 2.5 times as expensive, are slower and, while I'm sure they're better for complex tasks, don't do measurably better than 2.0 flash for what we need. Yay. We've stuck with the GCP/Gemini ecosystem up until now, but this is kind of forcing us to consider other LLM providers.

▲

toofy 4 hours ago | parent | next [-]

this is one of the reasons im hearing more and more people are using open/locally hosted models. particularly so we dont have to waste time to entirely redo everything when inevitably a company decides to pull the rug out from under us and change or remove something integral to our flow, which over the years we've seen countless times, and seems to be getting more and more common.

products entirely disappearing or significantly changing will be more and more common in the llm arena as things move forward towards companies shutting down, bubbles deflating, brand priorities drastically reshifting, etc...

i think, we're at or at least close to a time to really put some thought into which pieces of your flow could be done entirely with an open/local model and be honest with ourselves on which pieces of our flow truly needs sota or closed models that may entirely disappear or change. in the long run, putting a little bit of thought into this now will save a lot of headache later.

	▲	thraxil 2 hours ago \| parent \| next [-]
		Yeah. Back when Gemma2 came out we benchmarked it and were looking at open models. For our use case though, while the tasks are pretty simple, we do need a pretty large context window and Gemini had a big lead there over the open models for quite a while. I'll probably be evaluating the current batch of open models in the near future though.
	▲	jimbokun 4 hours ago \| parent \| prev [-]
		What’s interesting about this is that for previous technologies you could define a standard and demonstrate compliance with interfaces and behavior. But with LLMs, how do you know switching from one to another won’t change some behavior your system was implicitly relying on?

▲

elbear 3 hours ago | parent | prev [-]

In case you don't know, Gemini 2.5 flash is hosted on DeepInfra. They also have 1.5 flash but not 2.0 flash.

I have no affiliation with DeepInfra. I use them, because they host open-source models that are good.

	▲	thraxil 2 hours ago \| parent [-]
		Thanks. Yeah, for now we're moving to 3.1 flash lite as that's the new cheapest at $.25/1M and is also still "good enough". 2.5 flash is more expensive at $.30/1M (looks like Deep Infra charges the same as GCP/VertexAI for it). I might check them out for Gemma though. We benchmarked Gemma2 when that came out and it wasn't remotely usable for us largely because the context window was way too small. It looks like 3 or 4 might be worth evaluating though.

▲

Someone1234 7 hours ago | parent | prev | next [-]

> There isn't, pretty much everyone wants the best of the best.

For direct user interaction or coding problems, perhaps. But as API calls get cheaper, it becomes more realistic to use them for completely automated workflows against data-sets, or as sub-agents called from expensive SOTA models.

For example, in Claude, using Opus as an orchestrator to call Sonnet sub-agents, is a popular usage "hack." That only gets more powerful, as the Sonnet equivalent model gets cheaper. Now you can spawn entire teams of small specialized sub-agents with small context windows but limited scope.

▲

alexsmirnov 6 hours ago | parent | next [-]

Exactly.

I did create my own MCP with custom agents that combine several tools into a single one. For example, all WebSearch, WebFetch, Context7 exposed as a single "web research" tool, backed by the cheapest model that passes evaluation. The same for a codebase research

Use it with both Claude and Opencode saves a lot of time and tokens.

	▲	hadlock 19 minutes ago \| parent [-]
		I'd be interested in seeing the source for this if you have a moment

▲

thinkcontext 6 hours ago | parent | prev | next [-]

> But as API calls get cheaper, it becomes more realistic to use them for completely automated workflows against data-sets

Seems like a huge waste of money and electricity for processes that can be implemented as a traditional deterministic program. One would hope that tools would identify recurrent jobs that can be turned into simple scripts.

	▲	Someone1234 4 hours ago \| parent [-]
		It depends on the specific task. For example: "Here our dataset that contains customer feedback comment fields; look through them, draw out themes, associations, and look for trends." Solving that with a deterministic program isn't a trivial problem, and it is likely cheaper solved via LLM.

▲

jimbokun 4 hours ago | parent | prev [-]

That is a very complex, high level use case that takes time to configure and orchestrate.

There are many simpler tasks that would work fine with a simpler, local model.

▲

joefourier 7 hours ago | parent | prev | next [-]

Ever hit your daily limit on Claude Code and saw how expensive it is to pay per token?

▲

sidrag22 7 hours ago | parent | prev | next [-]

maybe there isnt, but as understanding grows people will understand that having an orchestration agent delegate simple work to lesser agents is significant not only for cost savings, but also for preserving context window space.

▲

wongarsu 6 hours ago | parent | prev | next [-]

For coding I want the best. Both I and $work do lots of things besides coding where smaller models like qwen3.5-27b work great, at much lower cost.

▲

scoopdewoop 7 hours ago | parent | prev | next [-]

That isn't true. In a Codex or Claude Code instance, sure... but those are not the main users of APIs. If you are using LLMs in a service for customers, costs matter.

▲

Aurornis 7 hours ago | parent | prev | next [-]

The market for API tokens is bigger than people like you and I (who also want the best) using then for code.

There are a lot of data science problems that benefit from running the dataset through an LLM, which becomes bottlenecked on per-token costs. For these you take a sample subset and run it against multiple providers and then do a cost versus accuracy tradeoff.

The market for API tokens is not just people using OpenCode and similar tools.

▲

wolttam 6 hours ago | parent | prev | next [-]

Nope. I get very good results from GLM 5 and 5.1. I’m not working on anything so complex and groundbreaking that I need the best.

Coding is a rung on the ladder of model capability. Frontier models will grow to take on more capabilities, while smaller more focused models start becoming the economical choice for coding

▲

regularfry 7 hours ago | parent | prev | next [-]

Everyone may want the best, but the amount of AI-addressable work outstrips the budget available for buying the best by quite a wide margin.

▲

noman-land 7 hours ago | parent | prev | next [-]

OpenCode allows for free inference tho.

▲

wolvoleo 5 hours ago | parent | prev | next [-]

Not really. It depends on the usecase. For private stuff I'm very happy to take what was SOTA a year or 2 ago if I can have it all running in my home and don't have to share any of my data with some sleazy big tech cloud.

The price is a concern too of course. But privacy is a bigger one for me. I absolutely don't trust any of their promises not to use data for training purposes.

▲

esafak 6 hours ago | parent | prev [-]

That's only because current models don't saturate people's needs. Once they are fast and smart enough people will pick cheaper ones.