Linear regression is one of the most basic statistical methods. It is used to predict the relationship between a dependent variable and independent variable/s by drawing a straight line to the data. There are many types of linear regression techniques like simple linear regression, multiple linear regression, polynomial regression and etc. However in this guide, I’ll be mainly covering the first two.
- Simple linear regression (SLR)
- Multiple Linear regression (MLR)
Simply recall:
- Independent variable is the variable that you can control or manipulate aka the input
- Dependent variable is the variable you measure or observe aka the output
Simple Linear Regression (one independent variable)
The formula for SLR is:
Where:
- y = dependent variable (what you’re predicting)
- x = independent variable (what you’re using to predict)
- β₁ = slope of the line (how much y changes when x increases by 1)
- β₀ = y-intercept (value of y when x = 0)
- ε = error term (residuals)
It’s pretty similar to a linear equation (y = mx + c), however in practice, data points don’t lie perfectly on the line, so we include an error term (ε) to represent the deviation of actual values from the predicted line.
SLR in action
Let’s say you want to predict test scores based on hours studied.
+-------------------+----------------+| Hours Studied (x) | Test Score (y) |+-------------------+----------------+| 1 | 50 || 2 | 55 || 3 | 65 || 4 | 70 || 5 | 80 |+-------------------+----------------+The algorithm calculates these using formulas:
Where and are the means (averages).
Let’s calculate:
- = (1+2+3+4+5)/5 = 3
- = (50+55+65+70+80)/5 = 64
After calculations (I’ll spare you from the long calculations but do it yourself):
- β₁ ≈ 7.5
- β₀ ≈ 41.5
Finally substitute them into the final equation:
Now let’s predict if someone studies for 6 hours, what’s their predicted score?
So the predicted test score is about 87.
The ε represents the difference between actual and predicted values. If a student who studied 3 hours got a score of 65:
- Predicted: y = 41.5 + 7.5(3) = 64
- Actual: 65
- Error (ε) = 65 - 64 = 1
Obviously, many of us dont perform manual calculations like this anymore, thats why we opt in using programming tools and libraries instead
Python example:
from sklearn.linear_model import LinearRegression
# Your dataX = [[1], [2], [3], [4], [5]] # hours studiedy = [50, 55, 65, 70, 80] # test scores
# Create and fit the modelmodel = LinearRegression()model.fit(X, y)
# Get coefficientsprint("β₀ (intercept):", model.intercept_) # 41.5print("β₁ (slope):", model.coef_[0]) # 7.5
# Make predictionpredicted_score = model.predict([[6]]) # For 6 hoursprint("Predicted score:", predicted_score) # 86.5Multiple Linear Regression (many independent variables)
The formula for MLR is:
Or in compact form:
Where:
- y = dependent variable (what you’re predicting)
- x₁, x₂, x₃, …, xₙ = independent variables (multiple predictors)
- β₀ = y-intercept (constant term)
- β₁, β₂, β₃, …, βₙ = coefficients (slopes) for each independent variable
- ε = error term (residuals)
- n = number of independent variables
The difference between SLR and MLR is that SLR only takes in one independent variable to predict the dependent variable while MLR takes in two or more. For example, if we want to predict a housing price, SLR would use either area size, number of rooms, or date of construction. MLR lets us factor all of these in at once.
MLR in action
| Size (x₁) sq ft | Bedrooms (x₂) | Age (x₃) years | Price (y) $1000s |
|---|---|---|---|
| 1500 | 3 | 10 | 300 |
| 2000 | 4 | 5 | 400 |
| 1200 | 2 | 15 | 250 |
| 1800 | 3 | 8 | 350 |
| 2500 | 4 | 3 | 500 |
The algorithm used to calculate the coefficients is too complex to do by hand, it uses matrix calculations. So we’ll just use software:
from sklearn.linear_model import LinearRegression
X = [ [1500, 3, 10], # [size, bedrooms, age] [2000, 4, 5], [1200, 2, 15], [1800, 3, 8], [2500, 4, 3]]
y = [300, 400, 250, 350, 500] # prices
# Create and fit the modelmodel = LinearRegression()model.fit(X, y)
# Get coefficientsprint("β₀ (intercept):", model.intercept_) # -312.5print("β₁ (size coef):", model.coef_[0]) # 0.25print("β₂ (bedrooms coef):", model.coef_[1]) # 37.5print("β₃ (age coef):", model.coef_[2]) # 12.5Interpreting the coefficients:
- β₀ = -312.5: Base price (intercept) — theoretically what a house with 0 sq ft, 0 bedrooms, and 0 age would cost. Not meaningful in real terms, just a mathematical baseline.
- β₁ = 0.25: For each additional square foot, price increases by $250
- β₂ = 37.5: For each additional bedroom, price increases by $37,500
- β₃ = 12.5: For each additional year of age, price increases by $12,500 (This seems weird, older houses should cost less. But with this small dataset, it shows a positive relationship. A larger dataset would catch the correct trend.)
Substituting them in:
Now let’s predict a house with: Size = 2200 sq ft, Bedrooms = 3, Age = 7 years:
Predicted price: $437,500
NOTEWhen you run the code, you may see results like
-312.50000000000034,0.25000000000000006, etc. That’s perfectly normal, it’s just floating-point precision in computers. They’re essentially just-312.5,0.25, and so on.