Remix clone Hacker News

new | show | ask | jobs Github

	▲	shikon7 9 days ago
		Yes, the forward pass does a next token prediction on all input tokens (so we know exactly how many tokens from the small model matched). The expensive thing is not the computation, but the memory bandwidth, as each pass needs to load the model from memory. If the small model predicts some tokens correctly, you save some passes, at the expense of doing some extra computations when the tokens were not correct. In any case, each forward pass will give at least one new token.