Dimensionality Reduction

Introduction

Modern datasets often contain hundreds or thousands of variables. A genomics study might measure expression levels of 20,000 genes; an image may consist of millions of pixels; a text corpus may track frequencies of tens of thousands of words. Working directly with such high-dimensional data creates both computational and statistical challenges.

Dimensionality reduction addresses these challenges by finding lower-dimensional representations that capture the essential structure of high-dimensional data. The goal is to reduce the number of variables while preserving as much relevant information as possible—retaining variance, preserving distances, or maintaining neighborhood relationships.

Beyond efficiency, dimensionality reduction supports visualization (projecting to two or three dimensions), noise reduction, feature extraction, and mitigation of the curse of dimensionality.

The Curse of Dimensionality

The curse of dimensionality refers to phenomena that arise in high-dimensional spaces.

One manifestation is sparsity. The volume of a space grows exponentially with dimension. In a unit hypercube in d dimensions, achieving average spacing ε requires on the order of (1/ε)d points.

Another phenomenon is distance concentration. For independent random vectors (X_1),…,(X_n)∈Rd, (max(∥)*(X_i)∥−min(∥)*(X_i)∥)/(min(∥)*(X_i)∥)→0*as *d→∞.

Distances become nearly indistinguishable, weakening nearest-neighbor methods.

Dimensionality reduction alleviates these effects by projecting data into lower-dimensional subspaces where structure is more meaningful.

Linear Dimensionality Reduction

Given (x_1),…,(x_n)∈Rd, a linear projection to Rk (with k<d) has the form(y_i)=WT*(x_i), where W∈R(d×k).

Principal Component Analysis

PCA finds orthogonal directions of maximal variance.

Assume centered data x=1/n*(∑_^)((x_i))=0. The first principal component solves (w_1)=arg(max(1/n))*(∑_i=1^n)(wT*(x_i))=arg(max(wT))*S*w, where S=1/n*(∑_^)((x_i))*(x_i)T.

Using Lagrange multipliers, S*w=λ*w.

Thus principal components are eigenvectors of S, ordered by eigenvalue magnitude.

Reconstruction View

Projection onto the first k components yields reconstruction (x_i)=(∑_j=1^k)((w_j)T*(x_i))*(w_j),

with mean squared error

1/n*(∑_^)‖(x_i)-(x_i)‖2=(∑_j=k+1^d)((λ_j)).

PCA via SVD

Let X be the centered data matrix. Its SVD:

X=U*Σ*VT.

Columns of V are principal components, and eigenvalues satisfy (λ_j)=(σ_j)2/n.

The projected data equals X*(V_k).

Variance Explained

(∑_j=1^k)((λ_j))/(∑_j=1^d)((λ_j))=(∑_j=1^k)((σ_j)2)/(∑_j=1^d)((σ_j)2).

Singular Value Decomposition

Any m×n matrix A satisfies A=U*Σ*VT.

Truncated SVD: A≈(A_k)=(U_k)*(Σ_k)*(V_k)T.

Eckart–Young theorem:

‖A−(A_k)‖F=√(,(∑_j=k+1^r)((σ_j)2)),‖A−(A_k)‖2=(σ_k+1).

Other Linear Methods

Factor Analysis

x=Λ*f+ϵ,Cov(x)=Λ*ΛT+Ψ.

Linear Discriminant Analysis

J(w)=(wT*(S_B)*w)/(wT*(S_W)*w).

Solution: leading eigenvector of (S_W)(−1)*(S_B).

Nonlinear Methods

Multidimensional Scaling

Stress=√(,(∑_^)((d_i*j)−‖(y_i)−(y_j)‖)/(∑_^)((d_i*j)2)).

Isomap

Uses geodesic distances and MDS.

Locally Linear Embedding

(∑_^)((x_i)−(∑_^)((w_i*j))*(x_j)).

t-SNE

(p_j|i)=exp(−∥(x_i)−(x_j)∥2/2*(σ_i)2)/(∑_^)(exp(−∥(x_i)−(x_k)∥2/2*(σ_i)2)).

(q_i*j)=((1+∥(y_i)−(y_j)∥2)(−1))/(∑_^)(1+∥(y_k)−(y_l)∥2).

Minimizes KL divergence.

UMAP

Constructs a fuzzy simplicial complex and optimizes a low-dimensional embedding.

Practical Considerations

Centering is essential for PCA. Scaling is necessary when variables differ in units.

Full SVD costs O(min(n*d2,n2*d)). Truncated methods reduce cost to O*(n*d(k)).

PCA allows out-of-sample projection via learned components; many nonlinear methods do not.

Applications

PCA supports image compression, genomics population analysis, latent semantic analysis, financial factor modeling, and single-cell visualization.

Summary

Dimensionality reduction maps high-dimensional data to lower-dimensional representations while preserving structure. PCA remains the primary linear method due to simplicity and interpretability. Nonlinear methods such as t-SNE and UMAP reveal manifold structure for visualization. The appropriate method depends on the objective, data geometry, and computational constraints.