| ▲ | samber 5 hours ago | |
Comparing compute cost versus FlashAttention-2 is not very honest to me. FlashAttention-2 is not used anymore for at least 2y. This architecture would have been a massive improvement 3 years ago, but it is a ~solved~ problem IMO. | ||