▲ | ModelForge 5 days ago | |
No the compiled version is actually faster. From that table, the A100 tok/sec (larger is faster) numbers are: - Eager: 28 - Compiled: 128 And - KV cache eager: 26 - KV cache compiled: 99 The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly | ||
▲ | ladberg 5 days ago | parent [-] | |
Ah yep read the labels backwards and meant that - ty for catching and for the explanation |