Remix.run Logo
aesthesia 11 hours ago

A top-k approximation still requires k forward passes; that's k times as expensive as just computing the exact value. Unless you're doing a prefix-unconditional prediction, in which case you still likely need quite a large token -> vector dictionary, and particularly for inner layers a significant amount of information left in the residual.

EGreg 10 hours ago | parent [-]

the k forward passes for different candidate tokens share all their prefix computation -- the KV cache up to position i-1 is identical for all candidates, so you run one pass through the shared layers and then k cheap single-token extensions. At long context lengths the shared prefix dominates the cost. This is also structurally what speculative decoding already does, so the infrastructure largely exists.