Remix.run Logo
vitus 3 days ago

> The GPU implementation's logarithmic scaling becomes evident at longer sequence lengths.

I don't see logarithmic scaling, actually. From the table for GRU performance, going from 16384 -> 65536 (namely: increasing the input by 4x) is roughly a 4x increase in time whether looking at CPU-scan or GPU-scan. Okay, maybe the inputs need to be bigger. Looking at the next plot, which goes up to 524288, we see the same behavior: the delta between CPU-scan and GPU-scan doubles as we double the input. That's a constant multiplicative factor. Same holds for LSTM performance.

Is this an artifact of the benchmark setup? Are we actually measuring the amount of time needed to load the full context into RAM? Or perhaps we're bottlenecked on memory bandwidth?

> Success: The gate extraction kernel, which was a huge bottleneck, now only takes 8% of the total time and is memory-bandwidth bound, saturating L2 bandwidth at 1.9 TB/s. This is a good place to be.

Sounds like that might be the case.

tripplyons 2 days ago | parent [-]

Typically the fastest approaches for associative RNNs combine the advantages of the parallel O(n log n) algorithm with a recurrent non-parallel O(n) approach by computing results for subsequence chunks in parallel and moving to the next chunk in a recurrent manner. This blog post explains the method (chunkwise parallel algorithm) well: https://sustcsonglin.github.io/blog/2024/deltanet-2/