Introduction to Linear Regression
Linear regression is simply a technique used to establish or find the relationship between two or more variables. In the context of machine learning and AI, linear regression is essentially a statistical method used to model and interpret how variables are related. Typically, by examining how the changes in one or more input features (independent variables) affects the result or outcome of the target (dependent variable).
In machine learning, we often refer to the independent variable as a feature
and the dependent variable as a label
or target
.
Establishing a Raw Understanding of The Linear Regression
In order to get a good idea of what linear regression is, imagine a regular graph where you're given the task of plotting daily total revenue against the number of customers who made purchases over a period of 10 days. Here is a sample of the data below,
Day | Total Customers | Total Revenue ($) |
---|---|---|
1 | 180 | 1820 |
2 | 220 | 2300 |
3 | 260 | 2550 |
4 | 300 | 3100 |
5 | 340 | 3200 |
6 | 380 | 3750 |
7 | 420 | 4300 |
8 | 460 | 3900 |
9 | 500 | 5100 |
10 | 540 | 4700 |
table 1.0: Total customers vs. Total revenue data
So now plotting this data as Total Customers vs. Total Revenue will give the graph below
From the plot above, a clear scatter plot has been displayed. However, the scatter plot alone doesn't provide much insight. To better understand the relationship between the data points, we need to find the best fit line (commonly referred to as linear regression).
In this case, we will create our own model by drawing the best fit line across the data points, as shown in Figure 1.2.
From the table and graph above, it is evident that the relationship between total revenue and the number of customers does not form a perfect straight line. Instead, the graph displays a line of best fit that approximates the general trend of the data. While individual data points may deviate from the line, it effectively captures the overall pattern and direction of the relationship between total revenue and the number of customers.
The main goal of linear regression
is to predict unknown data points based on known information. In other words, it uses a set of existing data to build a model that can estimate outcomes for new, unseen data.
Linear Regression Equation
The linear regression equation is very similar (if not the same as the straight line equation), let’s first examine the straight line equation
$y \; {=} \;mx + b \quad \quad \quad (1)$
From Equation (1), the terms are defined as:
$y$: The target or label (unknown)
$m$: The slope of the line
$x$: The input value (known)
$b$: The $y$-intercept
In Machine learning, the linear regression formula is presented as
$\hat{y} \; {=} \; b + wx \quad \quad \quad (2)$
From Equation (2), the terms are defined as:
$\hat{y}$: predicted value of the target (pronounced as
y hat
)$w$ (or sometimes $m$): weight or slope of the line, in this case (weight of the model)
$x$: input feature
$b$: bias or y-intercept
The predicted output $\hat{y}$ represents the model’s estimation of the target variable based on the input feature. It is the output generated by applying the learned parameters to the input data.
The parameter $w$ (or sometimes $m$) is the weight or slope of the line. It determines how much influence the input $x$ has on the prediction. In other words, it controls the steepness of the line.
The variable $x$ is the input feature, a known value provided to the model. It represents the independent variable or the factor used to make predictions.
The parameter $b$ is the bias or y-intercept. It adjusts the output independently of the input $x$, allowing the model to shift the prediction line up or down.
From the plot :
weight($w$) is calculated to be 8.55
bias($b$) is calculated to be 394.54
Example 1:
Suppose the business owner wants to estimate the revenue if 400 customers make purchases in a day. The owner determined that at least 400 customers daily is very crucial to preserve the business and retain staffs.
Day | Total Customers | Total Revenue($) |
---|---|---|
1 | 400 | - - |
Solution:
From equation (2), we can estimate the total revenue using the linear regression equation, we estimated the weight as 8.55
and bias as 394.54
, and the model is defined as $\hat{y} \; {=} \; b + wx$. So we can easily estimate the value to be $\hat{y} \; {=} \; (394.54) + (8.55)(400) = 3814.54$
Day | Total Customers | Predicted Total Revenue($) |
---|---|---|
1 | 400 | 3814.54 |
From this very simple example, the straight line is seen as the model that established the relationship between the customers and revenue.
Multilinear Linear Regression
We have examined linear regression with only one feature. However, in real life, there are usually multiple features or variables that determine the total sales of a business. Not just the number of customers. This brings us to multiple linear regression
.
To introduce multiple linear regression, let's now say we want to collect data on a customer by customer bases and these are the features or variables that we decided to collect:
Total income
Gender (Male or Female)
Age
Time of the season (holiday or workday)
Payment method (cash or card)
In example 1 the Total Customers
was represented as $x$, now we have up to 5 features, so we need a way to represent those features as well. Total income
, Gender
, Age
, Time of the season
and Payment method
feature will be represented as $x_1, \; x_2,\; x_3, \;x_4 \;and \;x_5$ respectively
now we need to represent these other features as well, so the general multilinear linear equation is given as
$\hat{y} \; {=} \; b + w_1x_1 + w_2x_2 + ……….. + w_nx_n \quad \quad \quad (3)$
So for problems with five(5) feature, the equation will be represented as:
$\hat{y} \; {=} \; b + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 \quad \quad \quad (4)$
Equation (4) above is the multilinear version of the regression method, limited to 5 feature, and equation (3) is dynamic and can be extended to correspond the number of features
now given this data with 5 features,
Day | Time Spent in Store (min) | Gender | Age | Is Holiday | Payment Method | Total Revenue ($) |
---|---|---|---|---|---|---|
1 | 267 | Male | 45 | False | Virtual | 220.50 |
2 | 195 | Female | 32 | True | Virtual | 310.00 |
3 | 379 | Male | 28 | True | Cash | 430.75 |
4 | 190 | Female | 40 | False | Cash | 180.00 |
5 | 364 | Male | 35 | False | Virtual | 390.20 |
6 | 248 | Female | 29 | True | Virtual | 260.00 |
7 | 328 | Male | 38 | False | Cash | 350.10 |
Now that we understand the basics of linear regression, In the next few articles lest learn how to develop the linear regression model, next is the Cost function, click next to go to the cost function page