Gradient Descent is an optimization algorithm used to minimize the cost function by iteratively adjusting the model’s parameters.


Intuition Behind Gradient Descent

When training a model, the goal is to minimize the cost function (or loss function), which measures how far off the model’s predictions are from the actual target values. Gradient Descent finds the direction and step size to adjust the parameters to reduce this loss.

Key Idea

Think of the cost function as a hilly terrain. Gradient Descent helps you “walk down the hill” to find the lowest point (global minimum).

1. The Cost Function

The cost function is denoted as and depends on the model parameters . Our goal is to minimize .

Where:

  • is the number of training examples.
  • is the hypothesis (or prediction) for input .
  • is the actual target value for

2. The Gradient Descent Algorithm

At each iteration, Gradient Descent updates the parameters using the formula:

Where:

  • is the learning rate, controlling the step size.
  • is the gradient of the cost function with respect to .

3. Choosing the Learning Rate

The learning rate is crucial. If it’s too small, the algorithm will take a long time to converge. If it’s too large, the algorithm may overshoot the minimum.

Tip: Visualizing the Gradient

At each step, Gradient Descent moves in the direction of the steepest descent based on the slope (gradient). The size of the steps is determined by the learning rate.

4. Convergence

Gradient Descent converges when the updates to become very small, i.e., when the change in the cost function is below a certain threshold.

How do we know if the model converged?

If , where is a very small number (like ), we can consider the model to have converged.


Types of Gradient Descent

  1. Batch Gradient Descent: Uses the entire dataset to compute the gradient at each step.
  2. Stochastic Gradient Descent (SGD): Updates the parameters based on a single training example per step.
  3. Mini-batch Gradient Descent: A mix of both, using small random subsets (mini-batches) of the data for updates.

Additional Reading

  • Explore different variants of Gradient Descent, such as Momentum, RMSProp, and Adam, to improve convergence speed and stability.