Remix.run Logo
zozbot234 10 hours ago

Shouldn't FlashAttention address the quadratic increase in memory footprint wrt. fine-tuning/training? I'm also pretty sure that it does not apply to pure inference due to how KV-caching works.