Remix clone Hacker News

new | show | ask | jobs Github

	▲	0-_-0 2 hours ago
		This can't be used to save VRAM in practice. To generate a new token with the primary model, you first need to decompress the cache, which involves regenerating the whole sequence from scratch. I.e. generate 1 million tokens with the small model to generate 1 with the large.