They say they are using https://github.com/tile-ai/TileRT
- persistent CUDA kernel
- tiled processing with overlapping read/writes
- model designed with specific constraints in mind