Gradient Descent

Gradient Descent: A powerful optimization algorithm that minimizes the cost function by updating model parameters step by step in the direction of the steepest descent.

Gradient descent is an optimization algorithm used to iteratively find the parameters of a model that minimize the Cost function (otherwise called Loss function). It is a mathematical technique that is used to find the parameters (weights and bias) that gives a model with the lowest cost.

Gradient descent is the algorithm used to train the linear regression model, The model training is initialized with random parameters (weights and bias).

How Gradient Descent works

Gradient descent is an iterative algorithm, that follows the following steps

Initialize Parameters: Start the process with any suitable random initial values for the model's parameters (weights and bias).
Estimate Cost: Calculate the cost of the current model using the current parameters (weights and bias).
Estimate Gradient: Compute the gradient of the cost function with respect to each parameter (weights and bias). This tells us the slope of the cost function at our current position for each parameter.
Update Parameters: Move each parameter by subtracting the learning rate times its corresponding gradient.
Repeat: Go back to step 2 and repeat the process until the cost function converges

The Math of Gradient descent

Gradient descent involves the estimation of the weight and bias until convergence, It is described as:

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{1} \; \newline b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \rbrace\end{align*}$$Gradient descent aims to estimate the weight $w$ and bias $b$ by partially differentiating the cost function. Let's say we decide to use the Mean Squared Error (MSE) as our cost function. In that case, we need to differentiate the cost function $J(w,b)$ with respect to both parameters.

The gradients are given by:

$$\frac{\partial J(w,b)}{\partial w} \quad \text{and} \quad \frac{\partial J(w,b)}{\partial b}$$

These are the partial derivatives of the cost function with respect to the weight and bias, respectively. They represent the direction and magnitude of change in the cost function for small changes in each parameter. Gradient descent uses these gradients to update $w$ and $b$ iteratively in the direction that reduces the cost function the most.

When defining the Mean Squared Error (MSE) cost function, you'll often see it expressed as a mean of the squared errors, as in equations 2(a) and 2(b). However, in machine learning, it's common practice to divide by 2m instead of just m. This is done to simplify the derivative of the cost function, as the 2 from the squaring operation conveniently cancels out the 2 in the denominator during gradient descent and represented in equation 2(c) and 2(d).

$$\displaystyle \text{MSE} = J(\mathbf{w},b) = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2 \tag{2a}$$$$\displaystyle \text{MSE} = J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} cost_{SE}^{(i)} \tag{2b}$$For a more convenient simplification, MSE becomes

$$\displaystyle \text{MSE} = J(\mathbf{w},b) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2 \tag{2c}$$$$\displaystyle \text{MSE} = J(\mathbf{w}, b) = \frac{1}{2m} \sum_{i=1}^{m} cost_{SE}^{(i)} \tag{2d}$$We take the partial derivative of $J(w,b)$ with respect to $w$:$$\begin{align}\frac{\partial J(w,b)}{\partial w} &= \frac{1}{2m} \sum\limits_{i = 0}^{m-1} 2(\hat{y}^{(i)} - y^{(i)}) \cdot x^{(i)} \tag{3a} \end{align}$$From equation (3a), we can clearly see the importance of representing the cost function using the $\frac{1}{2m}$ notation, because the constant 2 in the numerator cancels with the 1/2 in front. This simplifies the gradient to:$$\begin{align}\frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (\hat{y}^{(i)} - y^{(i)}) \cdot x^{(i)} \tag{3b} \end{align}$$We take the partial derivative of $J(w,b)$ with respect to $b$:

$$\begin{align} \frac{\partial J(w,b)}{\partial b} &= \frac{1}{2m} \sum\limits_{i = 0}^{m-1} 2(\hat{y}^{(i)} - y^{(i)})\tag{4a} \end{align}$$As with the previous case, the 2 cancels out with the 1/2, and we get:

$$\begin{align} \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (\hat{y}^{(i)} - y^{(i)}) \tag{4b} \end{align}$$These simplified gradient equations (3b) and (4b) are what we use in gradient descent to update the parameters $w$ and $b$ iteratively.

Where:

$w$ is the weight parameter (slope), which determines how strongly the input $x$ influences the output.
$b$ is the bias parameter (intercept), which allows the model to fit the data even when the optimal output is not centered at the origin.
$\alpha$ is the learning rate, a small positive scalar that controls the step size in each iteration of gradient descent.
$J(w, b)$ is the cost function (or loss function), which measures how far off the model’s predictions are from the actual target values.
$\frac{\partial J(w,b)}{\partial w} \; and \; \frac{\partial J(w,b)}{\partial b}$ are the partial derivatives (gradients) of the cost function with respect to the parameters $w$ and $b$, respectively.
$m$ is the number of training examples.
$x^{(i)}$ is the input feature of the $i$-th training example.
$y^{(i)}$ is the corresponding true output (label) of the $i$-th training example.
$\hat{y}^{(i)}$ is the model’s prediction for input $x^{(i)}$, typically defined as $\hat{y}^{(i)} = f_{w,b}(x^{(i)}) = wx^{(i)} + b$.

Implementation of Gradient Descent

In order to effectively implement the gradient descent algorithm, we need both the cost function and the gradients. We start by implementing the Mean Squared Error (MSE) cost function, then derive the gradients for $w$ and $b$. Finally, we use both the cost function and the gradients to develop the gradient descent algorithm.

MSE Cost function

let’s see how to implement the Mean Squared Error (MSE) in python from scratch

Python

def compute_mse(x, y_true, w, b): # Number of data points m = len(x) # Define the cost accumulator total_mse = 0 for i in range(m): # Compute prediction using hypothesis: y_hat = wx + b y_hat = w * x[i] + b # Error for each point error = y_hat - y_true[i] # Cost for each point cost = error ** 2 # Accumulate cost total_mse += cost # Average cost total_mse /= (2*m) return total_mse

Gradients $\frac{\partial J(w,b)}{\partial w}$,$\frac{\partial J(w,b)}{\partial b}$

Here we implement the compute_gradient function to calculate the gradients to return $\frac{\partial J(w,b)}{\partial w}$,$\frac{\partial J(w,b)}{\partial b}$.

Python

Gradient Descent algorithm

Now that we have the cost function and the gradients, let’s develop the gradient descent algorithm

Python

import math import copy def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): w = copy.deepcopy(w_in) # avoid modifying global w_in b = b_in cost_history = [] param_history = [] for i in range(num_iters): # Compute gradients dj_dw, dj_db = gradient_function(x, y, w, b) # Update parameters w = w - alpha * dj_dw b = b - alpha * dj_db # Record cost and parameters if i < 100000: # prevent excessive memory usage cost = cost_function(x, y, w, b) cost_history.append(cost) param_history.append([w, b]) return w, b, cost_history, param_history

In this implementation of gradient descent, we first initialize the model parameters $w$ and $b$, ensuring $w$ is deep-copied to avoid unintended side effects. We also prepare two lists to keep track of the cost at each iteration (cost_history) and the parameter values (param_history).

Then, for a specified number of iterations, we compute the gradients of the cost function with respect to www and bbb using the provided gradient_function. These gradients indicate the direction of steepest ascent, so we subtract a fraction (controlled by the learning rate alpha) from the current parameters to move in the direction that minimizes the cost.

After updating the parameters, we compute and record the new cost and parameter values for monitoring. Finally, the function returns the optimized parameters along with the recorded history of costs and parameters, which can be used for analysis or visualization of the training process.

Ads