Introduction to Linear Regression

Introduction to Linear Regression

Linear regression is simply a technique used to establish or find the relationship between two or more variables. In the context of machine learning and AI, linear regression is essentially a statistical method used to model and interpret how variables are related. Typically, by examining how the changes in one or more input features (independent variables) affects the result or outcome of the target (dependent variable).

In machine learning, we often refer to the independent variable as a feature and the dependent variable as a label or target.

Establishing a Raw Understanding of The Linear Regression

In order to get a good idea of what linear regression is, imagine a regular graph where you're given the task of plotting daily total revenue against the number of customers who made purchases over a period of 10 days. Here is a sample of the data below,

Day

Total Customers

Total Revenue ($)

1

180

1820

2

220

2300

3

260

2550

4

300

3100

5

340

3200

6

380

3750

7

420

4300

8

460

3900

9

500

5100

10

540

4700

table 1.0: Total customers vs. Total revenue data

So now plotting this data as Total Customers vs. Total Revenue will give the graph below

From the plot above, a clear scatter plot has been displayed. However, the scatter plot alone doesn't provide much insight. To better understand the relationship between the data points, we need to find the best fit line (commonly referred to as linear regression).

In this case, we will create our own model by drawing the best fit line across the data points, as shown in Figure 1.2.

From the table and graph above, it is evident that the relationship between total revenue and the number of customers does not form a perfect straight line. Instead, the graph displays a line of best fit that approximates the general trend of the data. While individual data points may deviate from the line, it effectively captures the overall pattern and direction of the relationship between total revenue and the number of customers.

The main goal of linear regression is to predict unknown data points based on known information. In other words, it uses a set of existing data to build a model that can estimate outcomes for new, unseen data.

Linear Regression Equation

The linear regression equation is very similar (if not the same as the straight line equation), let’s first examine the straight line equation

$y \; {=} \;mx + b \quad \quad \quad (1)$

From Equation (1), the terms are defined as:

  • $y$: The target or label (unknown)

  • $m$: The slope of the line

  • $x$: The input value (known)

  • $b$: The $y$-intercept

In Machine learning, the linear regression formula is presented as

$\hat{y} \; {=} \; b + wx \quad \quad \quad (2)$

From Equation (2), the terms are defined as:

  • $\hat{y}$: predicted value of the target (pronounced as y hat)

  • $w$ (or sometimes $m$): weight or slope of the line, in this case (weight of the model)

  • $x$: input feature

  • $b$: bias or y-intercept

The predicted output $\hat{y}$ represents the model’s estimation of the target variable based on the input feature. It is the output generated by applying the learned parameters to the input data.

The parameter $w$ (or sometimes $m$) is the weight or slope of the line. It determines how much influence the input $x$ has on the prediction. In other words, it controls the steepness of the line.

The variable $x$ is the input feature, a known value provided to the model. It represents the independent variable or the factor used to make predictions.

The parameter $b$ is the bias or y-intercept. It adjusts the output independently of the input $x$, allowing the model to shift the prediction line up or down.

From the plot :

weight($w$) is calculated to be 8.55

bias($b$) is calculated to be 394.54

Example 1:

Suppose the business owner wants to estimate the revenue if 400 customers make purchases in a day. The owner determined that at least 400 customers daily is very crucial to preserve the business and retain staffs.

Day

Total Customers

Total Revenue($)

1

400

- -

Solution:

From equation (2), we can estimate the total revenue using the linear regression equation, we estimated the weight as 8.55 and bias as 394.54, and the model is defined as $\hat{y} \; {=} \; b + wx$. So we can easily estimate the value to be $\hat{y} \; {=} \; (394.54) + (8.55)(400) = 3814.54$

Day

Total Customers

Predicted Total Revenue($)

1

400

3814.54

From this very simple example, the straight line is seen as the model that established the relationship between the customers and revenue.

Multilinear Linear Regression

We have examined linear regression with only one feature. However, in real life, there are usually multiple features or variables that determine the total sales of a business. Not just the number of customers. This brings us to multiple linear regression.

To introduce multiple linear regression, let's now say we want to collect data on a customer by customer bases and these are the features or variables that we decided to collect:

  • Total income

  • Gender (Male or Female)

  • Age

  • Time of the season (holiday or workday)

  • Payment method (cash or card)

In example 1 the Total Customers was represented as $x$, now we have up to 5 features, so we need a way to represent those features as well. Total income, Gender, Age, Time of the season and Payment method feature will be represented as $x_1, \; x_2,\; x_3, \;x_4 \;and \;x_5$ respectively

now we need to represent these other features as well, so the general multilinear linear equation is given as

$\hat{y} \; {=} \; b + w_1x_1 + w_2x_2 + ……….. + w_nx_n \quad \quad \quad (3)$

So for problems with five(5) feature, the equation will be represented as:

$\hat{y} \; {=} \; b + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 \quad \quad \quad (4)$

Equation (4) above is the multilinear version of the regression method, limited to 5 feature, and equation (3) is dynamic and can be extended to correspond the number of features

now given this data with 5 features,

Day

Time Spent in Store (min)

Gender

Age

Is Holiday

Payment Method

Total Revenue ($)

1

267

Male

45

False

Virtual

220.50

2

195

Female

32

True

Virtual

310.00

3

379

Male

28

True

Cash

430.75

4

190

Female

40

False

Cash

180.00

5

364

Male

35

False

Virtual

390.20

6

248

Female

29

True

Virtual

260.00

7

328

Male

38

False

Cash

350.10

Now that we understand the basics of linear regression, In the next few articles lest learn how to develop the linear regression model, next is the Cost function, click next to go to the cost function page

Ads