How are we supposed to think about SIMD in Big-O? Because this is still linear time if the kernel width is less than the max SIMD width (which is 16 I think on x64?)