Remix.run Logo
tripplyons 2 days ago

Have you explored chunkwise parallel approaches? They use the O(n log n) parallel algorithm within a subsequence and update bewteen chunks recurrently like the O(n) recurrent algorithm. These are usually the fastest kernels for these kinds of RNNs. Here is the best explanation I have seen: https://sustcsonglin.github.io/blog/2024/deltanet-2/