I wonder how different their method actually is from other sub-quadratic sparse attention methods like Reformer [1] and Routing Transformer [2].
[1]: https://arxiv.org/abs/2001.04451
[2]: https://arxiv.org/abs/2003.05997