I agree and disagree. Many of the best models are open source, just too big to run for most people.

And there are plenty of ways to fit these models! A Mac Studio M3 Ultra with 512 GB unified memory though has huge capacity, and a decent chunk of bandwidth (800GB/s. Compare vs a 5090's ~1800GB/s). $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive. Performance is even less, but a single AMD Turin chip with it's 12-channels DDR5-6000 can get you to almost 600GB/s: a 12x 64GB (768GB) build is gonna be $4000+ in ram costs, plus $4800 for for example a 48 core Turin to go with it. (But if you go to older generations, affordability goes way up! Special part, but the 48-core 7R13 is <$1000).

Still, those costs come to $5000 at the low end. And come with much less token/s. The "grid compute" "utility compute" "cloud compute" model of getting work done on a hot gpu with a model already on it by someone else is very very direct & clear. And are very big investments. It's just not likely any of us will have anything but burst demands for GPUs, so structurally it makes sense. But it really feels like there's only small things getting in the way of running big models at home!

Strix Halo is kind of close. 96GB usable memory isn't quite enough to really do the thing though (and only 256GB/s). Even if/when they put the new 64GB DDR5 onto the platform (for 256GB, lets say 224 usable), one still has to sacrifice quality some to fit 400B+ models. Next gen Medusa Halo is not coming for a while, but goes from 4->6 channels, so 384GB total: not bad.

(It sucks that PCIe is so slow. PCIe 5.0 is only 64GB/s one-direction. Compared to the need here, it's no-where near enough to have a big memory host and smaller memory gpu)

▲

Aurornis 9 days ago | parent | next [-]

> Many of the best models are open source, just too big to run for most people.

You can find all of the open models hosted across different providers. You can pay per token to try them out.

I just don't see the open models as being at the same quality level as the best from Anthropic and OpenAI. They're good but in my experience they're not as good as the benchmarks would suggest.

> $10k is a lot of money, but that ability to fit these very large models & get quality results is very impressive.

This is why I only appreciate the local LLM scene from a distance.

It’s really cool that this can be done, but $10K to run lower quality models at slower speeds is a hard sell. I can rent a lot of hours on an on-demand cloud server for a lot less than that price or I can pay $20-$200/month and get great performance and good quality from Anthropic.

I think the local LLM scene is fun where it intersects with hardware I would buy anyway (MacBook Pro with a lot of RAM) but spending $10K to run open models locally is a very expensive hobby.

▲

jstummbillig 10 days ago | parent | prev | next [-]

> Many of the best models are open source, just too big to run for most people

I don't think that's a likely future, when you consider all the big players doing enormous infrastructure projects and the money that this increasingly demands. Powerful LLMs are simply not a great open source candidate. The models are not a by-product of the bigger thing you do. They are the bigger thing. Open sourcing a LLM means you are essentially investing money to just give it away. That simply does not make a lot of sense from a business perspective. You can do that in a limited fashion for a limited time, for example when you are scaling or it's not really your core business and you just write it off as expenses, while you try to figure yet another thing out (looking at you Meta).

But with the current paradigm, one thing seems to be very clear: Building and running ever bigger LLMs is a money burning machine the likes of which we have rarely or ever seen, and operating that machine at a loss will make you run out of any amount of money really, really fast.

▲

esseph 9 days ago | parent | prev | next [-]

https://pcisig.com/pci-sig-announces-pcie-80-specification-t...

From 2003-2016, 13 years, we had PCIE 1,2,3.

2017 - PCIE 4.0

2019 - PCIE 5.0

2022 - PCIE 6.0

2025 - PCIE 7.0

2028 - PCIE 8.0

Manufacturing and vendors are having a hard time keeping up. And the PCIE 5.0 memory is.. not always the most stable.

▲

dcrazy 9 days ago | parent | next [-]

Are you conflating GDDR5x with PCIe 5.0?

▲

esseph 9 days ago | parent [-]

No.

I'm saying we're due for faster memory but seem to be having trouble scaling bus speeds as well (in production) and reliable memory. And the network is changing a lot, too.

It's a neverending cycle I guess.

	▲	dcrazy 9 days ago \| parent [-]
		One advantage of Apple Silicon is the unified memory architecture. You put memory on the fabric instead of on PCIe.

▲

jauntywundrkind 9 days ago | parent | prev [-]

Thanks for the numbers. Valuable contribution for sure!!

There's been a huge lag for PCIe adoption, and imo so so much has boiled down "do people need it"?

In the past 10 years I feel like my eyes have been opened that every high tech company's greatest highest most compelling desire is to slow walk the release out. To move as slow as the market will bear, to do as little as possible, to roll on and on with minor incremental changes.

There are canonball moments where the market is disrupted. Thank the fucking stars Intel got sick of all this shit and worked hard (with many others) to standardized NVMe, to make a post SATA world with higher speeds & better protocol. AMD64 architecture changed the game. Ryzen again. But so much of the industry is about retaining your cost advantage, is about retaining strong market segmentations, by never shipping too many PCIe lane platforms, by limiting consumer vs workstation vs server video card ram and vgpu (and mxgpu) and display out capabilities often entirely artificially.

But there is a fucking fire right now and everyone knows it. Nvlink is massively more bandwidth and massively more efficient and is essential to system performance. The need to get better fast is so on. Seems like for now SSD will keep slow walking their 2x's. But PCIe is facing a real crisis of being replaced, and everyone wants better. And hates hates hates the insane cost. PCIe 8.0 is going to be insane data to push over a differential, insane speed. But we have to.

Alas PCIe is also hampered by relatively generous broader system design. The trace distances are going to shrink, signal requirements increase a lot. But this needing a intercompatible compliance program for any peripheral to work is a significant disadvantage, versus, just make this point to point link work between these two cards.

There's so many energies happening right now in interconnect. I hope we see some actual uptake, some day. We've had so long for Gen-Z (Ethernet phy, gone now), CXL (3.x being switched, still un-arriced), now UltraEthernet and UltraLink. Man I hope we can see some step improvements. Everyone knows we are in deep shit if NV alone can connect systems. Ironically AMD's HyperTransport was open, was a path towards this, but now Infinity Fabric is an internal only thing and as branding & an idea vanishing from the world kind of, feels insufficient.

▲

esseph 9 days ago | parent [-]

All of these extremely high end technologies are so far away from hitting the consumer market.

Is there any desire for most people? What's the TAM?

▲

jauntywundrkind 9 days ago | parent | next [-]

Classic economics thinking: totally fucked "faster horses" thinking.

The addressable market depends on the advantage. Which right now: we don't know. It's all a guess that someone is going to find it valuable, and no one knows.

But if we find that we didn't actually need $700 NIC's to get shitty bandwidth, if we could have just been putting cables from PCIe shaped slot to PCIe slot (or oculink port!) and getting >>10x performance with >>10x less latency? Yeah bro uhh I think there might be a desire for using the same fucking chip we already use but getting 10x + 10x better out of it.

Faster lower latency cheaper storage? RAM expandability? Lower latency GPU access? There's so much that could make a huge difference for computing, broadly.

	▲	justincormack 9 days ago \| parent \| next [-]
		Thunderbolt tunnels pcie and you can use it as a nic in effect with one cable between devices. Its slower than oculink but more convenient.
	▲	esseph 8 days ago \| parent \| prev [-]
		I am very ready for optical bus lfg

▲

nemomarx 9 days ago | parent | prev [-]

Probably small consumer market of enthusiasts (notice Nvidia barely caters to gaming hardware lately) but if you can get better memory throughput on servers isn't that a large industry market?

▲

Rohansi 9 days ago | parent | prev | next [-]

You'll want to look at benchmarks rather than the theoretical maximum bandwidth available to the system. Apple has been using bandwidth as a marketing point but you're not always able to use that bandwidth amount depending on your workload. For example, the M1 Max has 400GB/s advertised bandwidth but the CPU and GPU combined cannot utilize all of it [1]. This means Strix Halo could actually be better for LLM inference than Apple Silicon if it achieves better bandwidth utilization.

[1] https://web.archive.org/web/20250516041637/https://www.anand...

▲

vFunct 9 days ago | parent | prev [-]

The game changer technology that'll enable full 1TB+ LLM models for cheap is Sandisk's High Bandwidth Flash. Expect devices with that in about 3-4 years, maybe even on cellphones.

	▲	jauntywundrkind 9 days ago \| parent [-]
		I'm crazy excited for High Bandwidth Flash, really hope they pull it off. There is a huge caveat: only having a couple hundred or thousand r/w cycles before your multi $k accelerator stops working!! A pretty big constraint! But as long as you are happy to keep running the same model, the wins here for large capacity & high bandwidth are sick ! And the affordability could be exceptional! (If you can afford to make flash with a hundred or so channels at a decent price!)