Remix clone Hacker News

Yes, I'll add that to the writeup! You're right, initially excluded it because it was really dependent on the providers, so lots of variance. Especially with the Qwen models.

High level results were:

- Qwen 32b => $0.33/1000 pages => 53s/page

- Qwen 72b => $0.71/1000 pages => 51s/page

- Llama 90b => $8.50/1000 pages => 44s/page

- Llama 11b => $0.21/1000 pages => 08s/page

- Gemma 27b => $0.25/1000 pages => 22s/page

- Mistral => $1.00/1000 pages => 03s/page

▲

dylan604 10 days ago | parent | next [-]

One of these things is not like the others. $8.50/1000?? Any chance that's a typo? Otherwise, for someone that has no experience with LLM pricing models, why is Llama 90b so expensive?

▲

int_19h 10 days ago | parent | next [-]

It's not uncommon when using brokers to see outliers like this. What happens basically is that some models are very popular and have many different providers, and are priced "close to the metal" since the routing will normally pick the cheapest option with the specified requirements (like context size). But then other models - typically more specialized ones - are only hosted by a single provider, and said provider can then price it much higher than raw compute cost.

E.g. if you look at https://openrouter.ai/models?order=pricing-high-to-low, you'll see that there are some 7B and 8B models that are more expensive than Claude Sonnet 3.7.

	▲	nickpsecurity 9 days ago \| parent [-]
		I'll add that some, big-name suppliers with big models might be running at or near a loss on purpose to draw in customers. That behavior is often encouraged by funders who gave them over $100 million to capture the market. Their theory is they can raise prices once their competitors go out of business. The companies open-sourcing pretrained models are countering that. So, we see a mix of huge models underpriced by scheming companies and open-source models priced for inference with free market principles.

▲

themanmaran 10 days ago | parent | prev [-]

That was the cost when we ran Llama 90b using TogetherAI. But it's quite hard to standardize, since it depends a lot on who is hosting the model (i.e. together, openrouter, grok, etc.)

I think in order to run a proper cost comparison, we would need to run each model on an AWS gpu instance and compare the runtime required.

▲

esafak 10 days ago | parent | prev [-]

A 2d plot would be great