This is getting very close to fit a single 3090 with 24gb VRAM :)

originalvichy 3 hours ago | parent | next [-]

Yup! Smaller quants will fit within 24GB but they might sacrifice context length.

I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already.

	▲	donmcronald 2 hours ago \| parent \| next [-]
		I have an Mini M4 Pro with 64GB of 273GB/s memory bandwidth and it's borderline with 3.5-27B. I assume this one is the same. I don't know a ton, but I think it's the memory bandwidth that limits it. It's similar on a DGX Spark I have access to (almost the same memory bandwidth). It's been a while since I tried it, but I think I was getting around 12-15 tokens per second an that feels slow when you're used to the big commercial models. Whenever I actually want to do stuff with the open source models, I always find myself falling back to OpenRouter. I tried Intel/Qwen3.6-35B-A3B-int4-AutoRound on a DGX Spark a couple days ago and that felt usable speed wise. I don't know about quality, but that's like running a 3B parameter model. 27B is a lot slower. I'm not sure if I "get" the local AI stuff everyone is selling. I love the idea of it, but what's the point of 128GB of shared memory on a DGX Spark if I can only run a 20-30GB model before the slow speed makes it unusable?
	▲	ycui1986 2 hours ago \| parent \| prev [-]
		32GB RAM on mac also need to host OS, software, and other stuff. There may not even be 24GB VRAM left for the model.

▲

GaggiX 3 hours ago | parent | prev [-]

At 4-bit quantization it should already fit quite nicely.

▲

Aurornis 2 hours ago | parent [-]

Unfortunately not with a reasonable context length.

	▲	kkzz99 2 hours ago \| parent \| next [-]
		It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.
	▲	GaggiX an hour ago \| parent \| prev [-]
		The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.