At 4-bit quantization it should already fit quite nicely.

Unfortunately not with a reasonable context length.

	▲	kkzz99 2 hours ago \| parent \| next [-]
		It really depends on what you think a reasonable context length is, but I can get 50k-60k on a 4090.
	▲	GaggiX an hour ago \| parent \| prev [-]
		The model uses Gated DeltaNet and Gated Attention so the memory usage of the KV cache is very low, even at BF16 precision.