| ▲ | coder543 5 hours ago | |||||||
MTP requires a separate KV cache, so there is more memory overhead than just the weights of the MTP model, but it's a manageable amount. | ||||||||
| ▲ | a_e_k 5 hours ago | parent [-] | |||||||
From the linked post, it didn't read like a separate KV cache was needed: > The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out. | ||||||||
| ||||||||