Corca

Collaborative Math Editor

Gradient Descent

Subtopic: Methods

Gradient descent finds minima by repeatedly stepping in the direction of steepest descent—opposite the gradient. It's the workhorse of machine learning, training everything from linear regression to deep neural networks. Simple to implement, it scales to millions of parameters.

Introduction

Imagine standing on a hilly landscape in fog. You want to reach the lowest point but can only feel the slope at your feet. The obvious strategy: step downhill, repeat. That's gradient descent.

The gradient tells you which direction is steepest uphill. Go the opposite way to descend fastest.

The Algorithm

To minimize f(x), repeat:

(x_k+1)=(x_k)−α*∇*f((x_k))

where α>0 is the learning rate (step size) and ∇f is the gradient.

In words: new position = old position − (step size) × (gradient)

Why It Works

The gradient ∇f(x) points in the direction of steepest increase. Moving opposite to it (-∇f) decreases f fastest.

For small enough α, each step is guaranteed to decrease f (unless already at a critical point).

Taylor expansion shows: f*(x−α*∇*f)≈f(x)−α‖∇*f‖2<f(x).

Learning Rate

The learning rate α is crucial:

• Too small: convergence is painfully slow

• Too large: overshoots, oscillates, or diverges

• Just right: converges quickly and stably

Finding good α often requires experimentation. Learning rate schedules (decreasing α over time) can help.

Worked Example

Minimize f(x)=x2 starting from (x_0)=10, using α=0.1.

The gradient is ∇*f(x)=2*x.

Iteration 1: (x_1)=10−0.1*(20)=10−2=8.

Iteration 2: (x_2)=8−0.1*(16)=8−1.6=6.4.

Iteration 3: (x_3)=6.4−0.1*(12.8)=6.4−1.28=5.12.

At each step, x shrinks by a factor of 0.8: (x_k)=10⋅0.8k→0.

The algorithm converges to the minimum at x=0.

Convergence

For convex functions with a Lipschitz-continuous gradient, gradient descent converges at a rate of O*(1/k).

For strongly convex functions, convergence is linear (exponentially fast).

For non-convex functions (deep learning), convergence to local minima is common but global optimum isn't guaranteed.

Variants

Stochastic Gradient Descent (SGD)

Use gradient from a random subset (mini-batch) of data. Much faster per iteration for large datasets.

Momentum

Accumulate velocity from past gradients. Helps escape plateaus and dampen oscillations.

Adam

Adaptive learning rates per parameter, combining momentum with gradient scaling. The default choice in deep learning.

Challenges

• Local minima: May converge to non-global minimum (for non-convex f)

• Saddle points: Gradients vanish but it's not a minimum

• Ill-conditioning: Elongated valleys cause zigzagging

• Plateaus: Near-zero gradients slow convergence

In Machine Learning

Training a model = minimizing a loss function. For neural networks, the gradient is computed via backpropagation.

One "epoch" = one pass through all training data. Training typically runs for many epochs.

Applications

Training neural networks, logistic regression, linear regression, matrix factorization, and essentially any differentiable optimization problem.

Summary

Gradient descent updates x←x−α*∇*f(x), stepping opposite the gradient. The learning rate α controls the step size—too small is slow, too large diverges. For smooth functions, it converges to local minima. Variants like SGD, momentum, and Adam improve speed and robustness. It is the foundation of modern machine learning optimization.

About · What Is Corca? · FAQ · Terms of Service · Privacy Policy

X (Twitter) · Discord · Bluesky