What counts as a lot of memory? What could someone do with 16 GB of RAM?

throwawayffffas 2 hours ago | parent | next [-]

Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.

You generally want to run q8 or some kind of "6bit" quantization at least.

40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.

Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.

▲

zozbot234 2 hours ago | parent | prev | next [-]

Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)

▲

abalashov 2 hours ago | parent | prev | next [-]

Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.

	▲	throwawayffffas 2 hours ago \| parent [-]
		Nah, you can run the 24b - 35b class with between 90k and 256k of context with about 40GB and they are pretty good. Especially the MOE variants fit neatly in 40GB.

▲

ValdikSS 2 hours ago | parent | prev | next [-]

Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.

▲

trouve_search 2 hours ago | parent | prev | next [-]

gemma 12B 4bit quant; try something with MTP and an AWQ quant

▲

monegator 2 hours ago | parent | prev [-]

gemma runs pretty well