Linear Regression Explained

Linear regression is one of the most basic statistical methods. It is used to predict the relationship between a dependent variable and independent variable/s by drawing a straight line to the data. There are many types of linear regression techniques like simple linear regression, multiple linear regression, polynomial regression and etc. However in this guide, I’ll be mainly covering the first two.

Simple linear regression (SLR)
Multiple Linear regression (MLR)

Simply recall:

Independent variable is the variable that you can control or manipulate aka the input
Dependent variable is the variable you measure or observe aka the output

Simple Linear Regression (one independent variable)#

The formula for SLR is:

$y = \beta_0 + \beta_1 x + \varepsilon$

Where:

y = dependent variable (what you’re predicting)
x = independent variable (what you’re using to predict)
β₁ = slope of the line (how much y changes when x increases by 1)
β₀ = y-intercept (value of y when x = 0)
ε = error term (residuals)

It’s pretty similar to a linear equation (y = mx + c), however in practice, data points don’t lie perfectly on the line, so we include an error term (ε) to represent the deviation of actual values from the predicted line.

SLR in action#

Let’s say you want to predict test scores based on hours studied.

1
+-------------------+----------------+
2
| Hours Studied (x) | Test Score (y) |
3
+-------------------+----------------+
4
|                 1 |             50 |
5
|                 2 |             55 |
6
|                 3 |             65 |
7
|                 4 |             70 |
8
|                 5 |             80 |
9
+-------------------+----------------+

The algorithm calculates these using formulas:

$\beta_1 = \frac{\sum[(x_i - \bar{x})(y_i - \bar{y})]}{\sum[(x_i - \bar{x})^2]}$

$\beta_0 = \bar{y} - \beta_1 \cdot \bar{x}$

Where $\bar{x}$ and $\bar{y}$ are the means (averages).

Let’s calculate:

$\bar{x}$ = (1+2+3+4+5)/5 = 3
$\bar{y}$ = (50+55+65+70+80)/5 = 64

After calculations (I’ll spare you from the long calculations but do it yourself):

β₁ ≈ 7.5
β₀ ≈ 41.5

Finally substitute them into the final equation:

$y = 41.5 + 7.5x + \varepsilon$

Now let’s predict if someone studies for 6 hours, what’s their predicted score?

$y = 41.5 + 7.5(6) = 41.5 + 45 = 86.5$

So the predicted test score is about 87.

The ε represents the difference between actual and predicted values. If a student who studied 3 hours got a score of 65:

Predicted: y = 41.5 + 7.5(3) = 64
Actual: 65
Error (ε) = 65 - 64 = 1

Obviously, many of us dont perform manual calculations like this anymore, thats why we opt in using programming tools and libraries instead

Python example:

1
from sklearn.linear_model import LinearRegression
2

3
# Your data
4
X = [[1], [2], [3], [4], [5]]  # hours studied
5
y = [50, 55, 65, 70, 80]        # test scores
6

7
# Create and fit the model
8
model = LinearRegression()
9
model.fit(X, y)
10

11
# Get coefficients
12
print("β₀ (intercept):", model.intercept_)  # 41.5
13
print("β₁ (slope):", model.coef_[0])         # 7.5
14

15
# Make prediction
16
predicted_score = model.predict([[6]])       # For 6 hours
17
print("Predicted score:", predicted_score)   # 86.5

Multiple Linear Regression (many independent variables)#

The formula for MLR is:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \cdots + \beta_n x_n + \varepsilon$

Or in compact form:

$y = \beta_0 + \sum \beta_i x_i + \varepsilon$

Where:

y = dependent variable (what you’re predicting)
x₁, x₂, x₃, …, xₙ = independent variables (multiple predictors)
β₀ = y-intercept (constant term)
β₁, β₂, β₃, …, βₙ = coefficients (slopes) for each independent variable
ε = error term (residuals)
n = number of independent variables

The difference between SLR and MLR is that SLR only takes in one independent variable to predict the dependent variable while MLR takes in two or more. For example, if we want to predict a housing price, SLR would use either area size, number of rooms, or date of construction. MLR lets us factor all of these in at once.

MLR in action#

Size (x₁) sq ft	Bedrooms (x₂)	Age (x₃) years	Price (y) $1000s
1500	3	10	300
2000	4	5	400
1200	2	15	250
1800	3	8	350
2500	4	3	500

The algorithm used to calculate the coefficients is too complex to do by hand, it uses matrix calculations. So we’ll just use software:

1
from sklearn.linear_model import LinearRegression
2

3
X = [
4
    [1500, 3, 10],  # [size, bedrooms, age]
5
    [2000, 4, 5],
6
    [1200, 2, 15],
7
    [1800, 3, 8],
8
    [2500, 4, 3]
9
]
10

11
y = [300, 400, 250, 350, 500]  # prices
12

13
# Create and fit the model
14
model = LinearRegression()
15
model.fit(X, y)
16

17
# Get coefficients
18
print("β₀ (intercept):", model.intercept_)       # -312.5
19
print("β₁ (size coef):", model.coef_[0])         # 0.25
20
print("β₂ (bedrooms coef):", model.coef_[1])     # 37.5
21
print("β₃ (age coef):", model.coef_[2])          # 12.5

Interpreting the coefficients:

β₀ = -312.5: Base price (intercept) — theoretically what a house with 0 sq ft, 0 bedrooms, and 0 age would cost. Not meaningful in real terms, just a mathematical baseline.
β₁ = 0.25: For each additional square foot, price increases by $250
β₂ = 37.5: For each additional bedroom, price increases by $37,500
β₃ = 12.5: For each additional year of age, price increases by $12,500 (This seems weird, older houses should cost less. But with this small dataset, it shows a positive relationship. A larger dataset would catch the correct trend.)

Substituting them in:

$y = -312.5 + 0.25x_1 + 37.5x_2 + 12.5x_3$

Now let’s predict a house with: Size = 2200 sq ft, Bedrooms = 3, Age = 7 years:

$\text{Price} = -312.5 + 0.25(2200) + 37.5(3) + 12.5(7)$

$= -312.5 + 550 + 112.5 + 87.5 = 437.5$

Predicted price: $437,500

NOTE
When you run the code, you may see results like -312.50000000000034, 0.25000000000000006, etc. That’s perfectly normal, it’s just floating-point precision in computers. They’re essentially just -312.5, 0.25, and so on.