Derivation of DFT and FFT

The Discrete Fourier Transform (DFT) is the fundamental tool for analyzing the spectral content of finite-length discrete-time signals. To derive the DFT, we begin with the Discrete-Time Fourier Transform (DTFT).

Let x[n]x[n] be a discrete sequence of length NN, defined for 0nN10 \le n \le N-1. The DTFT of this sequence is a continuous function of frequency ω\omega, defined as:

X(ejω)=n=0N1x[n]ejωnX(e^{j\omega}) = \sum_{n=0}^{N-1} x[n] e^{-j\omega n}

Since X(ejω)X(e^{j\omega}) is continuous, it is not directly suitable for digital computation. To obtain a computable representation, we sample the frequency spectrum at NN equally spaced points in the interval [0,2π)[0, 2\pi). Let the sampling frequencies be:

ωk=2πNk,k=0,1,,N1\omega_k = \frac{2\pi}{N}k, \quad k = 0, 1, \dots, N-1

Substituting ωk\omega_k into the DTFT expression results in the definition of the DFT, denoted as X[k]X[k]:

X[k]=X(ejω)ω=2πkN=n=0N1x[n]ej2πNkn,k=0,1,,N1X[k] = X(e^{j\omega}) \Big|_{\omega = \frac{2\pi k}{N}} = \sum_{n=0}^{N-1} x[n] e^{-j\frac{2\pi}{N}kn}, \quad k = 0, 1, \dots, N-1

For notational conciseness, we introduce the twiddle factor (or phase factor) WNW_N:

WN=ej2πNW_N = e^{-j\frac{2\pi}{N}}

Thus, the standard synthesis and analysis equations for the DFT are formally defined as:

DFT:X[k]=n=0N1x[n]WNkn,0kN1\text{DFT:} \quad X[k] = \sum_{n=0}^{N-1} x[n] W_N^{kn}, \quad 0 \le k \le N-1 IDFT:x[n]=1Nk=0N1X[k]WNkn,0nN1\text{IDFT:} \quad x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] W_N^{-kn}, \quad 0 \le n \le N-1

Direct computation of the DFT based on the above equation requires NN complex multiplications and N1N-1 complex additions for each of the NN values of kk. Therefore, the total computational complexity is O(N2)O(N^2).

Derivation of the Fast Fourier Transform (FFT)

The Fast Fourier Transform (FFT) describes a family of algorithms designed to compute the DFT efficiently. Here, we derive the classic Radix-2 Decimation-in-Time (DIT) Cooley-Tukey algorithm.

Assumption: Let NN be a power of 2, i.e., N=2MN = 2^M.

The derivation relies on the “Divide and Conquer” strategy. We split the summation in the DFT formula into two parts: one for the even-indexed samples and one for the odd-indexed samples.

X[k]=n=0N1x[n]WNkn=n evenx[n]WNkn+n oddx[n]WNknX[k] = \sum_{n=0}^{N-1} x[n] W_N^{kn} = \sum_{n \text{ even}} x[n] W_N^{kn} + \sum_{n \text{ odd}} x[n] W_N^{kn}

Let n=2rn = 2r for the even part and n=2r+1n = 2r + 1 for the odd part, where r=0,1,,N21r = 0, 1, \dots, \frac{N}{2}-1.

X[k]=r=0N21x[2r]WNk(2r)+r=0N21x[2r+1]WNk(2r+1)X[k] = \sum_{r=0}^{\frac{N}{2}-1} x[2r] W_N^{k(2r)} + \sum_{r=0}^{\frac{N}{2}-1} x[2r+1] W_N^{k(2r+1)}

We can factor out the term WNkW_N^k from the second summation:

X[k]=r=0N21x[2r](WN2)kr+WNkr=0N21x[2r+1](WN2)krX[k] = \sum_{r=0}^{\frac{N}{2}-1} x[2r] (W_N^2)^{kr} + W_N^k \sum_{r=0}^{\frac{N}{2}-1} x[2r+1] (W_N^2)^{kr}

Using the property of the twiddle factor WN2=ej2πN2=ej2πN/2=WN/2W_N^2 = e^{-j\frac{2\pi}{N} \cdot 2} = e^{-j\frac{2\pi}{N/2}} = W_{N/2}, the equation becomes:

X[k]=r=0N21x[2r]WN/2krDFTN/2 of even terms+WNkr=0N21x[2r+1]WN/2krDFTN/2 of odd termsX[k] = \underbrace{\sum_{r=0}^{\frac{N}{2}-1} x[2r] W_{N/2}^{kr}}_{\text{DFT}_{N/2} \text{ of even terms}} + W_N^k \underbrace{\sum_{r=0}^{\frac{N}{2}-1} x[2r+1] W_{N/2}^{kr}}_{\text{DFT}_{N/2} \text{ of odd terms}}

Let G[k]G[k] be the N/2N/2-point DFT of the even sequence x[2r]x[2r], and H[k]H[k] be the N/2N/2-point DFT of the odd sequence x[2r+1]x[2r+1]. We can write:

X[k]=G[k]+WNkH[k]X[k] = G[k] + W_N^k H[k]

However, the indices of G[k]G[k] and H[k]H[k] are modulo N/2N/2. Due to the periodicity of the DFT (G[k+N/2]=G[k]G[k + N/2] = G[k]), we can compute X[k]X[k] for the second half of the spectrum (N/2k<NN/2 \le k < N) using the same G[k]G[k] and H[k]H[k] values.

Utilizing the symmetry property of the twiddle factor WNk+N/2=WNkW_N^{k + N/2} = -W_N^k, we derive the Butterfly Operations:

X[k]=G[k]+WNkH[k],0k<N2X[k] = G[k] + W_N^k H[k], \quad 0 \le k < \frac{N}{2} X[k+N2]=G[k]WNkH[k],0k<N2X[k + \frac{N}{2}] = G[k] - W_N^k H[k], \quad 0 \le k < \frac{N}{2}

By recursively applying this decomposition, an NN-point DFT is reduced to M=log2NM = \log_2 N stages. Each stage involves N/2N/2 butterfly operations. Consequently, the total complexity is reduced from O(N2)O(N^2) to O(Nlog2N)O(N \log_2 N).

Introduction to FFTW

FFTW (Fastest Fourier Transform in the West) is a comprehensive C subroutine library for computing the discrete Fourier transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data. Developed by Matteo Frigo and Steven G. Johnson at the Massachusetts Institute of Technology (MIT), FFTW has established itself as the de facto benchmark for performance in the scientific computing community.

Unlike traditional FFT implementations that rely on a fixed algorithm (e.g., a static radix-2 Cooley-Tukey implementation), FFTW employs a novel architecture designed to adapt to the underlying hardware. Its superior performance is primarily attributed to three key architectural features:

The Planner Mechanism

The core innovation of FFTW is the planner. DSP architectures vary significantly in terms of register file size, cache associativity, and instruction pipeline depth. An algorithm that is optimal on one machine may perform poorly on another.

Instead of imposing a single algorithmic strategy, FFTW’s planner decomposes the problem at runtime. It measures the actual execution time of different algorithmic compositions (called “plans”) on the target hardware and selects the fastest one. This empirical search allows FFTW to automatically exploit hardware-specific features without requiring manual tuning by the user.

Code Generation and Codelets

FFTW achieves extreme optimization through metaprogramming. The critical computational kernels, known as codelets, are not written by hand. Instead, they are generated by a special-purpose compiler written in OCaml, named genfft.

genfft produces highly optimized C code that includes:

  • Extensive loop unrolling to minimize branching overhead.
  • Algebraic simplification of constants and trigonometric values.
  • Optimal scheduling of floating-point instructions to hide latency.

This automated generation allows FFTW to contain thousands of optimized kernels for various small sizes (e.g., N=3,5,7,11,N=3, 5, 7, 11, \dots) and architectural constraints.