Remix.run Logo
jurgenburgen 5 days ago

If models become more efficient we will move more of the work to local devices instead of using SaaS models. We’re still in the mainframe era of LLM.

rainsford 5 days ago | parent | next [-]

We moved from the mainframe era to desktops and smaller servers because computers got fast enough to do what we needed them to do locally. Centralized computing resources are still vastly more powerful than what's under your desk or in a laptop, but it doesn't matter because people generally don't need that much power for their daily tasks.

The problem with AI is that it's not obvious what the upper limit of capability demand might be. And until or if we get there, there will always be demand for the more capable models that run on centralized computing resources. Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.

lelanthran 5 days ago | parent | next [-]

> Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.

Only if it's competitively priced. You wouldn't want to use the SaaS if the breakeven in investment on local instances is a matter of months.

Right now people are shelling out for Claude Code and similar because for $200/m they can consume $10k/m of tokens. If you were actually paying $10k/m, than it makes sense to splurge $20k-$30k for a local instance.

zozbot234 5 days ago | parent | prev [-]

The underlying advantage of local inference is that you're repurposing your existing hardware for free. You don't need your token spend to pay a share of the capex cost for datacenters that are large enough to draw gigawatts in power, you can just pay for your own energy use. Even though the raw energy cost per operation will probably be higher for local inference, the overall savings in hardware costs can still be quite real.

throwatdem12311 5 days ago | parent | prev | next [-]

The hyperscalers do not want us running models at the edge and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.

Imustaskforhelp 5 days ago | parent | next [-]

> of circular fake money

Oh it gets worse than that, the money which caused all of this by OpenAI was taken from Japanese banks at cheap interest rates (by softbank for the stargate project), and the Japanese Banks are able to do it because of Japanese people/Japanese companies and also the collateral are stocks which are inflated by the value of people who invest their hard earned money into the markets

So in a way they are using real hard earned money to fund all of this, they are using your money to basically attack you behind your backs.

I once wrote an really long comment about the shaky finances of stargate, I feel like suggesting it here: https://news.ycombinator.com/item?id=47297428

joquarky 5 days ago | parent [-]

What is the difference between "hard earned" and not?

Imustaskforhelp 5 days ago | parent [-]

Well cartel money for example, depends on the definition of hard earned but I don't quite imagine for example the japanese Yakuza to deposit into banks/stock markets for example, I am not sure but I imagine something like gold/cash being used.

Maybe you can argue that yakuza is making hard earned money but imo, they are doing illegal activities within the law and are doing something more closer to extortion.

Ironically, in a sense, what AI did in a sense is also an extortion.

One is just legal (barely, I am not even sure how or why), the other isn't. That was my intention to highlight when I said hard earned money.

topspin 5 days ago | parent | prev | next [-]

> they will spend infinite amounts of circular fake money > forever

If that's the plan (there is no plan) then it expires at some point, because it's a spiral and such spirals always bottom out.

throwatdem12311 5 days ago | parent [-]

And when that happens people STILL won’t be able to afford the hardware.

lelanthran 5 days ago | parent [-]

> And when that happens people STILL won’t be able to afford the hardware.

Of course they will - if that happens all these AI token providers won't have a use for all that hardware they bought. You'll be buying used H100s and H200s off eBay for pennies on the dollar.

throwatdem12311 5 days ago | parent [-]

No they won’t they’re just going to get absorbed into Azure and AWS and used for generic GPU compute that you rent until they’re burned out trash.

Dylan16807 4 days ago | parent [-]

Then those datacenters will barely need any new GPUs, so the companies making them will be desperate to get gamers to buy cards and set very competitive prices.

naasking 5 days ago | parent | prev [-]

> and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.

That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.

lelanthran 5 days ago | parent | next [-]

> That's ridiculous, "infinite money" isn't a thing.

My reading of GP is that he was being sarcastic - "infinite amounts of circular fake money" is probably a reference to these circular deals going on.

If A hands B investment of $100, then B hands A $100 for purchase of hardware, A's equity in B, on paper, is $100, plus A has revenue of $100 (from B), which gives A total assets of $200.

Obviously it has to be shuffled more thoroughly, but that's the basic idea that I thought GP was referring to.

throwatdem12311 5 days ago | parent | prev | next [-]

Cheaper for who? For them maybe but certainly not for you or me.

5 days ago | parent | prev [-]
[deleted]
DeathArrow 5 days ago | parent | prev | next [-]

I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.

Not many people would like today models comparable to what was SOTA 2 years ago.

To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.

None of those two conditions seem to become true for the near future.

delecti 5 days ago | parent | prev | next [-]

As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.

Though I'm not an expert, maybe my understanding of the memory allocation is wrong.

dd8601fn 5 days ago | parent [-]

Seems to me if the model and the kv cache are competing for the same pool of memory, then massively compressing the cache necessarily means more ram available for (if it fits) a larger model, no?

delecti 5 days ago | parent [-]

Yes, but the context is a comparatively smaller part of how much memory is used when running it locally for a single user, vs when running it on a server for public... serving.

mustyoshi 5 days ago | parent | prev | next [-]

I don't see how we'll ever get to widespread local LLM.

The power efficiency alone is a strong enough pressure to use centralized model providers.

My 3090 running 24b or 32b models is fun, but I know I'm paying way more per token in electricity, on top of lower quality tokens.

It's fun to run them locally, but for anything actually useful it's cheaper to just pay API prices currently.

leptons 5 days ago | parent | next [-]

AI is not cheap to run no matter where it is running. The price we get charged today for AI is a loss-leader. The actual cost is much higher, so much higher that the average paying user today would balk at what it actually costs to run. These AI companies are trying to get people hooked on their product, to get it integrated into every business and workflow that they can, then start raising prices.

singpolyma3 5 days ago | parent | prev [-]

Until you put up your solar and then power is almost free...

vidarh 5 days ago | parent [-]

The amortised cost including the panels and labour is nowhere near "almost free".

boredatoms 5 days ago | parent [-]

It is over a couple of years

vidarh 5 days ago | parent [-]

Even if you live somewhere where it does, that is not remotely "almost free", and lots of places the payback period is more in the range of 10-15 years even with subsidies.

Ray20 5 days ago | parent | prev | next [-]

> If models become more efficient

Then we can make them even bigger.

Imustaskforhelp 5 days ago | parent [-]

> Then we can make them even bigger.

But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"

There are some people here/on r/localllama who I have seen run some small models and sometimes even run multiple of them to solve/iterate quickly and have a larger model plug into it and fix anything remaining.

This would still mean that larger/SOTA models might have some demand but I don't think that the demand would be nearly enough that people think, I mean, we all still kind of feel like there are different models which are good for different tasks and a good recommendation is to benchmark different models for your own use cases as sometimes there are some small models who can be good within your particular domain worth having within your toolset.

Ray20 5 days ago | parent | next [-]

> But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"

It's simple: then we'll make our intents and purposes bigger.

Almondsetat 5 days ago | parent | prev [-]

Because the true goal is AGI, not just nice little tools to solve subsets of problems. The first company which can achieve human level intelligence will just be able to self-improve at such a rate as to create a gigantic moat

jurgenburgen 2 days ago | parent | next [-]

There’s no evidence that the current architectures will reach AGI levels.

Of course OpenAI wants you to think they will rule the world but if we’ve reached the plateau of LLM capabilities regardless of the amount of compute we throw at them then local models will soon be good enough.

9rx 5 days ago | parent | prev | next [-]

> The first company which can achieve human level intelligence will just be able to...

They say prostitution is the oldest industry of all. We know how to achieve human-level intelligence quite well. The outstanding challenge is figuring out how to produce an energy efficient human-level intelligence.

Dylan16807 4 days ago | parent | prev [-]

There's no particular reason to assume a human level AI would be able to improve itself any better than the thousands of human level humans that designed it.

Almondsetat 4 days ago | parent [-]

Sure, but: that single human with the intelligence of a top tier engineer of scientist will have immediate access to all human knowledge. Plus, what do you think happens the moment its optimizes itself to run in 2, 4, 8, 16, etc. parallel instances?

Dylan16807 4 days ago | parent [-]

Well, A) "top tier engineer/scientist" is a significant step above generic human, B) the human engineers/scientists also have immediate access to the same database, C) The humans have been optimizing it for even longer, so what makes us think the AI can optimize itself even a couple percent?

For example, if the number of AIs you can run per petaflop started to scale with the cube root of researcher-years, then even if your researcher AIs are quite fast and you can double your density in a couple years, hitting 5x will take a decade and hitting 10x will approach half a century.

acuozzo 5 days ago | parent | prev | next [-]

But what about The Jevons Paradox?

ssyhape 5 days ago | parent | prev [-]

[flagged]

lucasfin000 5 days ago | parent [-]

MoE feels a lot more like engineering to me. You're routing around the problem rather than actually solving it. The real math gains are things like quantization schemes that change how information is actually represented. Whether that distinction matters long term probably will depend on whether we hit a capability wall first or an efficiency ceiling first.