Remix.run Logo
coder543 5 hours ago

MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount.

a_e_k 5 hours ago | parent [-]

From the linked post, it didn't read like a separate KV cache was needed:

> The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out.

coder543 5 hours ago | parent [-]

That's great news. That has not been the case with other MTP implementations like Qwen3.5, but I see the section in the article saying Google introduced some architectural optimizations to make this possible.