Gradient Descent
Subtopic: Methods
Gradient descent finds minima by repeatedly stepping in the direction of steepest descent—opposite the gradient. It's the workhorse of machine learning, training everything from linear regression to deep neural networks. Simple to implement, it scales to millions of parameters.
Introduction
Imagine standing on a hilly landscape in fog. You want to reach the lowest point but can only feel the slope at your feet. The obvious strategy: step downhill, repeat. That's gradient descent.
The gradient tells you which direction is steepest uphill. Go the opposite way to descend fastest.
The Algorithm
To minimize
where
In words: new position = old position − (step size) × (gradient)
Why It Works
The gradient
For small enough
Taylor expansion shows:
Learning Rate
The learning rate α is crucial:
• Too small: convergence is painfully slow
• Too large: overshoots, oscillates, or diverges
• Just right: converges quickly and stably
Finding good
Worked Example
Minimize
The gradient is
Iteration
Iteration
Iteration
At each step,
The algorithm converges to the minimum at
Convergence
For convex functions with a Lipschitz-continuous gradient, gradient descent converges at a rate of
For strongly convex functions, convergence is linear (exponentially fast).
For non-convex functions (deep learning), convergence to local minima is common but global optimum isn't guaranteed.
Variants
Stochastic Gradient Descent (SGD)
Use gradient from a random subset (mini-batch) of data. Much faster per iteration for large datasets.
Momentum
Accumulate velocity from past gradients. Helps escape plateaus and dampen oscillations.
Adam
Adaptive learning rates per parameter, combining momentum with gradient scaling. The default choice in deep learning.
Challenges
• Local minima: May converge to non-global minimum (for non-convex
• Saddle points: Gradients vanish but it's not a minimum
• Ill-conditioning: Elongated valleys cause zigzagging
• Plateaus: Near-zero gradients slow convergence
In Machine Learning
Training a model = minimizing a loss function. For neural networks, the gradient is computed via backpropagation.
One "epoch" = one pass through all training data. Training typically runs for many epochs.
Applications
Training neural networks, logistic regression, linear regression, matrix factorization, and essentially any differentiable optimization problem.
Summary
Gradient descent updates