High-performance implementation of Cooley–Tukey algorithm designed for FFT & NTT with support for:
- [✅] radix-2
- [✅] NTT / finite field elements
- [✅] FFT / complex field elements
- [✅] SIMD vectorization for the butterfly operations
- [✅] parallelization of independent operations
- [✅] cache-friendly memory access patterns and alignment
- [✅] benchmarking support
- [❌] radix-4, mixed-radix