yeah, then theres prompt loading too.

but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.

That just sounds like a 3090.

	▲	cyanydeez 10 minutes ago \| parent [-]
		not at the vram sizes that control how much context to load; also, GPUs arn't as effiecient as direct inference.