Dimensionality Reduction
Introduction
Modern datasets often contain hundreds or thousands of variables. A genomics study might measure expression levels of
Dimensionality reduction addresses these challenges by finding lower-dimensional representations that capture the essential structure of high-dimensional data. The goal is to reduce the number of variables while preserving as much relevant information as possible—retaining variance, preserving distances, or maintaining neighborhood relationships.
Beyond efficiency, dimensionality reduction supports visualization (projecting to two or three dimensions), noise reduction, feature extraction, and mitigation of the curse of dimensionality.
The Curse of Dimensionality
The curse of dimensionality refers to phenomena that arise in high-dimensional spaces.
One manifestation is sparsity. The volume of a space grows exponentially with dimension. In a unit hypercube in
Another phenomenon is distance concentration. For independent random vectors
Distances become nearly indistinguishable, weakening nearest-neighbor methods.
Dimensionality reduction alleviates these effects by projecting data into lower-dimensional subspaces where structure is more meaningful.
Linear Dimensionality Reduction
Given
Principal Component Analysis
PCA finds orthogonal directions of maximal variance.
Assume centered data
Using Lagrange multipliers,
Thus principal components are eigenvectors of
Reconstruction View
Projection onto the first
with mean squared error
PCA via SVD
Let
Columns of
The projected data equals
Variance Explained
Singular Value Decomposition
Any
Truncated SVD:
Eckart–Young theorem:
Other Linear Methods
Factor Analysis
Linear Discriminant Analysis
Solution: leading eigenvector of
Nonlinear Methods
Multidimensional Scaling
Isomap
Uses geodesic distances and MDS.
Locally Linear Embedding
t-SNE
Minimizes KL divergence.
UMAP
Constructs a fuzzy simplicial complex and optimizes a low-dimensional embedding.
Practical Considerations
Centering is essential for PCA. Scaling is necessary when variables differ in units.
Full SVD costs
PCA allows out-of-sample projection via learned components; many nonlinear methods do not.
Applications
PCA supports image compression, genomics population analysis, latent semantic analysis, financial factor modeling, and single-cell visualization.
Summary
Dimensionality reduction maps high-dimensional data to lower-dimensional representations while preserving structure. PCA remains the primary linear method due to simplicity and interpretability. Nonlinear methods such as t-SNE and UMAP reveal manifold structure for visualization. The appropriate method depends on the objective, data geometry, and computational constraints.