I have recently needed a decently performing FFT. Instead of doing Cooley-Tukey, I have realized the bruteforce version essentially computes two vector×matrix products, so I have interleaved and reshaped the matrices for sequential full-vector loads, and did bruteforce version with AVX1 and FMA3 intrinsics. Good enough for my use case of moderately sized FFT where matrices fit in L2 cache.

▲

HarHarVeryFunny 4 days ago | parent [-]

I'm curious why you wouldn't just use a library like FFTW or Intel's IPP (or NVidia's cuFFT if applicable) ?

▲

Const-me 4 days ago | parent [-]

For FFTW the showstopper was GPL license. For IPP, 200 MB of binary dependencies, also I remember when Intel was caught testing for Intel CPUs specifically in their runtime libraries instead or CPUID feature bits, deliberately crippling performance on AMD CPUs. I literally don’t have any Intel CPUs left in this house. For cuFFT, the issue is vendor lock-in to nVidia.

And the problem is IMO too small to justify large dependencies. I only needed like 200×400 FFT as a minor component of a larger software.

	▲	tkuraku 3 days ago \| parent [-]
		It would be interesting to see how it compares to https://gitlab.mpcdf.mpg.de/mtr/pocketfft. The c++ branch is header only. I believe this is what scipy uses by default