Remix.run Logo
uf00lme 2 days ago

Woah, is this part of the future of models? Basically little models you can use as tools.

2ndorderthought 2 days ago | parent | next [-]

It's looking like running your own mini ecosystem is the way of the future to me. No data centers, just a decent GPU 16-24gb of VRAM, CPU, and 32gb of RAM.

Lalabadie 2 days ago | parent [-]

This is Apple's bet, among others.

Training purpose-specific miniature models lets you have a lot of tasks you can run with high confidence on consumer hardware.

twoodfin 2 days ago | parent [-]

Or on a commodity EC2 instance with a relatively cheap inference sidecar.

tonyarkles 2 days ago | parent | prev | next [-]

https://www.docling.ai/

I don’t know how many difference little models this uses under the hood, but I was shocked at how good it was at the couple document extraction tasks I threw it at.

SecretDreams 2 days ago | parent | prev | next [-]

Eventually we'll have models small enough to do a single thing really well and we'll call them functions.

hathym 2 days ago | parent [-]

True if you can write a function that summerize an article for example

cyanydeez 2 days ago | parent | prev [-]

I'm pretty sure there's someone somewhere who'll create a proper harness that's equivalent to one giant model. The difficulty is mostly local hardware has lot of memory constraints. Targeting 128GB would seem to be the current sweet spot. If we could get out of the corporate market movers of buying up all the memory, we could maybe have more.

Regardless, the people in the 80s capable of pruning programs to fit on small devices is likely happening now. I'd bet most of the Chinese firms are doing it because of the US's silly GPU games among other constraints.

nickpsecurity 2 days ago | parent [-]

What needs to happen is for companies (or individuals) tired of that to pool money together to build new, memory products. Then, sell them to consumers first and for non-AI use. If not that, then round-robin scheduling of quantities so the units are spread around more.

If costs are high, they might reserve a certain percentage for big business at market prices (or just under) to cover the chip's mask costs.

After DDR5+ RAM, then GDDR5-6 RAM for use with AI accelerators. They might try to jump right in on a HBM alternative. That could be the percentage for AI buyers I just mentioned. Especially if they could put 40-80GB on accelerators like Intel ARC's.

If successful enough, they license MIPS' gaming GPU's to combine with this stuff with full, open-source stack and RTOS support for military sales.

Tuna-Fish a day ago | parent [-]

Time for my daily "HBF is coming" comment.

The next step for models is to put the weights on flash, connected with a very wide interface to the accelerator. The first users will be datacenters, but it should trickle down to consumer hardware eventually. A single 512GB stack is expected to cost about $200, and provide 1.6TB/s of reads.

You still need some fast DRAM for the KV cache and for activations, but weights should be sitting on flash.

zozbot234 a day ago | parent | next [-]

Reading from Flash is too power-intensive compared to DRAM, this is why Flash offload isn't used in the data center today. Flash is also prone to wearing out quickly so ephemeral data like the KV-cache can't really be stashed in there. Unless your model has an unprecedented level of sparsity I just don't see how HBF could ever be useful.

Tuna-Fish 20 hours ago | parent [-]

Currently available flash is obviously unusable. HBF is not that.

The reason HBF is (about to be) a thing is that flash manufacturers realized that if you heavily optimize flash for read throughput and energy, as opposed to density, you can match DRAM on throughput and get to within 2x on energy, at the cost of half your density. That would make the density still ~50 times better than DRAM, built on a cheap mass-produced process. All manufacturers are chasing this hard right now, with first samples to arrive later this year.

You are correct that it would absolutely not be used for any mutable data, only weights in inference. This is both because there is insufficient endurance (expected to be ~hundreds of drive writes total), but also because it will be very slow to write compared to the read speed. A single HBF stack is expected to provide 1.6TB/s reads, and single-digit GB/s writes. That's why I wrote the last sentence of my post that you replied to.

nickpsecurity a day ago | parent | prev [-]

You're thinking in a provably-useful direction:

https://arxiv.org/pdf/2312.11514

Tuna-Fish 20 hours ago | parent [-]

HBF is not that. The paper you linked is about how to use flash memory that exists to boost LLM performance, with all kinds of optimization tricks. HBF is about making flash memory that doesn't require any of those tricks, and just has the read throughput that's needed for inference.