Remix clone Hacker News

new | show | ask | jobs Github

	▲	zackangelo 7 hours ago
		17b per token. So when you’re generating a single stream of text (“decoding”) 17b parameters are active. If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap). When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.