One. Trillion. Even on native int4 that’s… half a terabyte of vram?!

Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…

▲

johndough 4 hours ago | parent | next [-]

The model absolutely can be run at home. There even is a big community around running large models locally: https://www.reddit.com/r/LocalLLaMA/

The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).

The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).

At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.

The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.

▲

WhitneyLand an hour ago | parent | next [-]

Its often pointed out in the first sentence of a comment how a model can be run at home, then (maybe) towards the end of the comment it’s mentioned how it’s quantized.

Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible.

The degree of quality loss is not often characterized. Which makes sense because it’s not easy to fully quantify quality loss with a few simple benchmarks.

By the time it’s quantized to 4 bits, 2 bits or whatever, does anyone really have an idea of how much they’ve gained vs just running a model that is sized more appropriately for their hardware, but not lobotomized?

	▲	FuckButtons 25 minutes ago \| parent \| next [-]
		From my own usage, the former is almost always better than the latter. Because it’s less like a lobotomy and more like a hangover, though I have run some quantized models that seem still drunk. Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work. I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8. In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.
	▲	selfhoster11 43 minutes ago \| parent \| prev [-]
		Except the parent comment said you can stream the weights from an SSD. The full weights, uncompressed. It takes a little longer (a lot longer), but the model at least works without lossy pre-processing.

▲

1dom 2 hours ago | parent | prev | next [-]

> The model absolutely can be run at home. There even is a big community around running large models locally

IMO 1tln parameters and 32bln active seems like a different scale to what most are talking about when they say localLLMs IMO. Totally agree there will be people messing with this, but the real value in localLLMs is that you can actually use them and get value from them with standard consumer hardware. I don't think that's really possible with this model.

▲

zamadatix an hour ago | parent | next [-]

Local LLMs are just LLMs people run locally. It's not a definition of size, feature set, or what's most popular. What the "real" value is for local LLMs will depend on each person you ask. The person who runs small local LLMs will tell you the real value is in small models, the person who runs large local LLMs will tell you it's large ones, those who use cloud will say the value is in shared compute, and those who don't like AI will say there is no value in any.

LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.

	▲	1dom 30 minutes ago \| parent [-]
		> LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large. I agree. My point was that most aren't thinking of models this large when they're talking about local LLMs. That's what I said, right? This is supported by the download counts on hf: the most downloaded local models are significantly smaller than 1tln, normally 1 - 12bln. I'm not sure I understand what point you're trying to make here?

▲

zozbot234 2 hours ago | parent | prev | next [-]

32B active is nothing special, there's local setups that will easily support that. 1T total parameters ultimately requires keeping the bulk of them on SSD. This need not be an issue if there's enough locality in expert choice for any given workload; the "hot" experts will simply be cached in available spare RAM.

▲

spmurrayzzz an hour ago | parent | next [-]

When I've measured this myself, I've never seen a medium-to-long task horizon that would have expert locality such that you wouldn't be hitting the SSD constantly to swap layers (not to say it doesn't exist, just that in the literature and in my own empirics, it doesn't seem to be observed in a way you could rely on it for cache performance).

Over any task that has enough prefill input diversity and a decode phase thats more than a few tokens, its at least intuitive that experts activate nearly uniformly in the aggregate, since they're activated per token. This is why when you do something more than bs=1, you see forward passes light up the whole network.

	▲	zozbot234 an hour ago \| parent [-]
		> hitting the SSD constantly to swap layers Thing is, people in the local llm community are already doing that to run the largest MoE models, using mmap such that spare-RAM-as-cache is managed automatically by the OS. It's a drag on performance to be sure but still somewhat usable, if you're willing to wait for results. And it unlocks these larger models on what's effectively semi-pro if not true consumer hardware. On the enterprise side, high bandwidth NAND Flash is just around the corner and perfectly suited for storing these large read-only model parameters (no wear and tear issues with the NAND storage) while preserving RAM-like throughput.

▲

1dom 2 hours ago | parent | prev [-]

I never said it was special.

I was trying to correct the record that a lot of people will be using models of this size locally because of the local LLM community.

The most commonly downloaded local LLMs are normally <30b (e.g. https://huggingface.co/unsloth/models?sort=downloads). The things you're saying, especially when combined together, make it not usable by a lot of people in the local LLM community at the moment.

▲

GeorgeOldfield an hour ago | parent | prev [-]

do you guys understand that different experts are loaded PER TOKEN?

▲

dev_l1x_be 2 hours ago | parent | prev | next [-]

How do you split the model between multiple GPUs?

	▲	evilduck 2 hours ago \| parent [-]
		With "only" 32B active params, you don't necessarily need to. We're straying from common home users to serious enthusiasts and professionals but this seems like it would run ok on a workstation with a half terabyte of RAM and a single RTX6000. But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo...

▲

PlatoIsADisease an hour ago | parent | prev [-]

>The model absolutely can be run at home.

There is a huge difference between "look I got it to answer the prompt: '1+1='"

and actually using it for anything of value.

I remember early on people bought Macs (or some marketing team was shoveling it), and proposing people could reasonably run the 70B+ models on it.

They were talking about 'look it gave an answer', not 'look this is useful'.

While it was a bit obvious that 'integrated GPU' is not Nvidia VRAM, we did have 1 mac laptop at work that validated this.

Its cool these models are out in the open, but its going to be a decade before people are running them at a useful level locally.

	▲	esafak 42 minutes ago \| parent [-]
		Hear, hear. Even if the model fits, a few tokens per second make no sense. Time is money too.

▲

wongarsu 4 hours ago | parent | prev | next [-]

Which conveniently fits on one 8xH100 machine. With 100-200 GB left over for overhead, kv-cache, etc.

▲

Davidzheng 5 hours ago | parent | prev [-]

that's what intelligence takes. Most of intelligence is just compute