Cost Function in Linear Regression
The Cost measures the difference between the predicted values and the actual values of each data point. It is a numerical value that represents how inaccurate the model’s predictions are. In simpler terms, it is the penalty the model pays each time it makes a wrong or incorrect prediction. The greater the difference between the predicted and actual values, the higher the penalty.
From the image above, the arrows pointing from the data points to the best fit line represent the prediction error, the difference between the actual value $y^{(i)}$ and the predicted value $\hat{y}^{(i)}$. This difference is often referred to as the residual or error for that specific data point.
To measure how large this error is numerically, we use the cost function (or loss function) for a single example, let’s explore the different cost functions:
The cost function equations
The general representation of the cost function is the same for both single-variable linear regression and multivariable linear regression. The main difference lies in the form of the input data. In single-variable linear regression, the inputs are typically scalar values, that is, one-dimensional (1D) data points.
But in the case of multivariable linear regression, it uses vector inputs, where each data point contains multiple features or dimensions. As a result, the hypothesis function and cost function are adapted to handle vector operations using linear algebra. Despite this change in representation, the underlying idea of minimizing the difference between predicted, and actual values remains exactly the same.
Types of Cost Functions in Linear Regression
In linear regression, four main types of cost function formulas are used to evaluate a model's performance. The choice of which cost function to use often depends on the specific problem and dataset. Let's explore these four main types.
Mean Squared Error (MSE)
The Mean Squared Error (MSE) is the average of the squared differences between the predicted values $y$ and the actual values $y$. Essentially, it's the Sum of Squared Errors (SSE) divided by the total number of data points $m$. Let’s explore the MSE:
To measure how large this error is numerically for a single training example, we use a cost term which is the square of the difference between the predicted and actual value for that example:
$$cost_{SE}^{(i)} = ( \hat{y}^{(i)} - y^{(i)})^2 \quad \quad \quad (1)$$This squared term penalizes larger errors more heavily and ensures all error contributions are positive. Recall that $\hat{y}^{(i)}$ is the predicted value from the model (e.g., a best-fit line), which for a simple linear model is computed as:
$$\hat{y}^{(i)} \; {=} \; b + wx^{(i)} \quad \quad \quad (2)$$So, while the arrow on the graph visually represent the raw difference $\hat{y}^{(i)} - y^{(i)}$, the cost function for MSE uses the square of this difference to quantify the magnitude of the error for that example.
The overall MSE for the entire dataset is then the average of these squared errors:
$$\displaystyle \text{MSE} = J(\mathbf{w},b) = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2 \quad \quad \quad (3a)$$$$\displaystyle \text{MSE} = J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} cost_{SE}^{(i)} \quad \quad \quad (3b)$$Where:
$m$ : Total number of data points
$y_i$ : Actual (true) value for the $i$-th data point
$\hat{y}_i$ : Predicted value for the $i$-th data point
$(\hat{y}_i - y_i)²$ :Squared error for the $i$-th data point
Implementation of MSE in Python
let’s see how to implement the Mean Squared Error (MSE) in python from scratch
Sum of Absolute Errors (SAE)
The Sum of Absolute Errors (SAE) measures the total sum of the absolute differences between the predicted values $\hat{y}$ and the actual values $y$.
For a single training example, we can define an "absolute error cost," $cost_{AE}^{(i)}$:
$$cost_{AE}^{(i)} = |\hat{y}^{(i)} - y^{(i)}|$$The SAE is the sum of these individual absolute error costs over all $m$ data points:
$$\displaystyle \text{SAE} = J(\mathbf{w},b) = \sum_{i=1}^{m} cost_{AE}^{(i)}$$This gives the standard formula for SAE:
$$\displaystyle \text{SAE} = J(\mathbf{w},b) = \sum_{i=1}^{m} \left| \hat{y}_i - y_i \right|$$Where:
$m$ : Total number of data points
$y_i$ : Actual (true) value for the $i$-th data point
$\hat{y}_i$ : Predicted value for the $i$-th data point
$\left| \hat{y}_i - y_i \right|$ : Absolute error for the $i$-th data point
Implementation of SAE in Python
let’s see how to implement the Sum of Absolute Errors (SAE) in python from scratch
Mean Absolute Error (MAE)
The Mean Absolute Error (MAE) is the average of the absolute differences between the predicted values $\hat{y}$ and the actual values $y$. Essentially, it's the Sum of Absolute Errors (SAE) divided by the total number of data points $m$.
Using the individual absolute error $cost_{AE}^{(i)} = |\hat{y}^{(i)} - y^{(i)}|$, the MAE is the average of these costs:
$$\displaystyle \text{MAE} = J(\mathbf{w},b) = \frac{1}{m} \sum_{i=1}^{m} cost_{AE}^{(i)}$$This expands to the common MAE formula:
$$\displaystyle \text{MAE} = J(\mathbf{w},b) = \frac{1}{m} \sum_{i=1}^{m} \left| \hat{y}_i - y_i \right|$$Where:
$m$ : Total number of data points
$y_i$ : Actual (true) value for the $i$-th data point
$\hat{y}_i$ : Predicted value for the $i$-th data point
$\left| \hat{y}_i - y_i \right|$ : Absolute error for the $i$-th data point
The MAE handles outliers better than MSE because it doesn't square errors. However, it's not differentiable at zero ($\hat{y} = y$), which can make gradient descent a bit trickier, though usually manageable.
Implementation of MAE in Python
let’s see how to implement the Mean Absolute Error (MAE) in python from scratch
Sum of Squared Errors (SSE)
The Sum of Squared Errors (SSE) is the sum of the squared differences between the predicted values $\hat{y}$ and the actual values $y$. It quantifies the total squared deviation of the predictions from the true outcomes.
Similar to MSE, we can consider the individual squared error for each data point, $cost_{SE}^{(i)} = (\hat{y}^{(i)} - y^{(i)})^2$, as defined in equation (1). The SSE is then the sum of these individual squared error costs:
$$\displaystyle \text{SSE} = J(\mathbf{w},b) = \sum_{i=1}^{m} cost_{SE}^{(i)}$$This leads directly to its standard formula:
$$\displaystyle \text{SSE} = J(\mathbf{w},b) = \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2$$Where:
$m$ : Total number of data points
$y_i$ : Actual (true) value for the $i$-th data point
$\hat{y}_i$ : Predicted value for the $i$-th data point
$(\hat{y}_i - y_i)²$ : Squared error for the $i$-th data point
Implementation of SSE in Python
let’s see how to implement the Squared Errors (SSE) in python from scratch
Solved Examples
The application of a cost function equation to single-variable versus multi-variable problems primarily depends on the data's dimensionality.
Let's now estimate the cost using the single-variable dataset presented in Figure 1.0 above, using all the cost functions equations and so we can investigate the differences. This data shows business revenue versus total customers over a 10-day period.
Day | Total Customers | Total Revenue ($) → $y^{(i)}$ | Predicted Revenue($) → $\hat{y}^{(i)}$ |
---|---|---|---|
1 | 180 | 1820 | 1933.27 |
2 | 220 | 2300 | 2275.21 |
3 | 260 | 2550 | 2617.15 |
4 | 300 | 3100 | 2959.09 |
5 | 340 | 3200 | 3301.03 |
6 | 380 | 3750 | 3642.97 |
7 | 420 | 4300 | 3984.91 |
8 | 460 | 3900 | 4326.85 |
9 | 500 | 5100 | 4668.79 |
10 | 540 | 4700 | 5010.73 |
Using MSE
So from Equation (1):
$cost_{SE}^{(i)} = ( \hat{y}^{(i)} - y^{(i)})^2$, Thus $cost_{SE}^{(i)}$ is:
Total Revenue ($) → $y^{(i)}$ | Predicted Revenue($) → $\hat{y}^{(i)}$ | Cost → $(\hat{y}^{(i)} - y^{(i)})²$ |
---|---|---|
1820 | 1933.27 | 12830.71 |
2300 | 2275.21 | 614.44 |
2550 | 2617.15 | 4509.32 |
3100 | 2959.09 | 19855.37 |
3200 | 3301.03 | 10207.12 |
3750 | 3642.97 | 11455.48 |
4300 | 3984.91 | 99282.28 |
3900 | 4326.85 | 182199.63 |
5100 | 4668.79 | 185943.89 |
4700 | 5010.73 | 96551.44 |
Now, to compute the total cost over all examples, we apply Equation (3b), hence:
$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} cost_{SE}^{(i)}$
$J(\mathbf{w},b) = \frac{623449.70}{2 × 10} = 31172.48$
Thus, the cost is calculated to be 31172.48,
And that’s how we estimate the cost function of linear regression. Attempt this same problem with SSE, MAE, and SAE to see how different cost functions behave. Up next is Gradient Descent, an algorithm used to minimize these cost functions by iteratively updating the model’s parameters in the direction of the steepest descent — ultimately helping us find the best-fitting line.