Remix.run Logo
in-silico 2 days ago

I wonder how different their method actually is from other sub-quadratic sparse attention methods like Reformer [1] and Routing Transformer [2].

[1]: https://arxiv.org/abs/2001.04451

[2]: https://arxiv.org/abs/2003.05997