860 words
4 minutes
Linear Regression Explained

Linear regression is one of the most basic statistical methods. It is used to predict the relationship between a dependent variable and independent variable/s by drawing a straight line to the data. There are many types of linear regression techniques like simple linear regression, multiple linear regression, polynomial regression and etc. However in this guide, I’ll be mainly covering the first two.

  • Simple linear regression (SLR)
  • Multiple Linear regression (MLR)

Simply recall:

  • Independent variable is the variable that you can control or manipulate aka the input
  • Dependent variable is the variable you measure or observe aka the output

Simple Linear Regression (one independent variable)#

The formula for SLR is:

y=β0+β1x+εy = \beta_0 + \beta_1 x + \varepsilon

Where:

  • y = dependent variable (what you’re predicting)
  • x = independent variable (what you’re using to predict)
  • β₁ = slope of the line (how much y changes when x increases by 1)
  • β₀ = y-intercept (value of y when x = 0)
  • ε = error term (residuals)

It’s pretty similar to a linear equation (y = mx + c), however in practice, data points don’t lie perfectly on the line, so we include an error term (ε) to represent the deviation of actual values from the predicted line.

SLR in action#

Let’s say you want to predict test scores based on hours studied.

+-------------------+----------------+
| Hours Studied (x) | Test Score (y) |
+-------------------+----------------+
| 1 | 50 |
| 2 | 55 |
| 3 | 65 |
| 4 | 70 |
| 5 | 80 |
+-------------------+----------------+

The algorithm calculates these using formulas:

β1=[(xixˉ)(yiyˉ)][(xixˉ)2]\beta_1 = \frac{\sum[(x_i - \bar{x})(y_i - \bar{y})]}{\sum[(x_i - \bar{x})^2]}

β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \cdot \bar{x}

Where xˉ\bar{x} and yˉ\bar{y} are the means (averages).

Let’s calculate:

  • xˉ\bar{x} = (1+2+3+4+5)/5 = 3
  • yˉ\bar{y} = (50+55+65+70+80)/5 = 64

After calculations (I’ll spare you from the long calculations but do it yourself):

  • β₁ ≈ 7.5
  • β₀ ≈ 41.5

Finally substitute them into the final equation:

y=41.5+7.5x+εy = 41.5 + 7.5x + \varepsilon

Now let’s predict if someone studies for 6 hours, what’s their predicted score?

y=41.5+7.5(6)=41.5+45=86.5y = 41.5 + 7.5(6) = 41.5 + 45 = 86.5

So the predicted test score is about 87.

The ε represents the difference between actual and predicted values. If a student who studied 3 hours got a score of 65:

  • Predicted: y = 41.5 + 7.5(3) = 64
  • Actual: 65
  • Error (ε) = 65 - 64 = 1

Obviously, many of us dont perform manual calculations like this anymore, thats why we opt in using programming tools and libraries instead

Python example:

from sklearn.linear_model import LinearRegression
# Your data
X = [[1], [2], [3], [4], [5]] # hours studied
y = [50, 55, 65, 70, 80] # test scores
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Get coefficients
print("β₀ (intercept):", model.intercept_) # 41.5
print("β₁ (slope):", model.coef_[0]) # 7.5
# Make prediction
predicted_score = model.predict([[6]]) # For 6 hours
print("Predicted score:", predicted_score) # 86.5

Multiple Linear Regression (many independent variables)#

The formula for MLR is:

y=β0+β1x1+β2x2+β3x3++βnxn+εy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \cdots + \beta_n x_n + \varepsilon

Or in compact form:

y=β0+βixi+εy = \beta_0 + \sum \beta_i x_i + \varepsilon

Where:

  • y = dependent variable (what you’re predicting)
  • x₁, x₂, x₃, …, xₙ = independent variables (multiple predictors)
  • β₀ = y-intercept (constant term)
  • β₁, β₂, β₃, …, βₙ = coefficients (slopes) for each independent variable
  • ε = error term (residuals)
  • n = number of independent variables

The difference between SLR and MLR is that SLR only takes in one independent variable to predict the dependent variable while MLR takes in two or more. For example, if we want to predict a housing price, SLR would use either area size, number of rooms, or date of construction. MLR lets us factor all of these in at once.

MLR in action#

Size (x₁) sq ftBedrooms (x₂)Age (x₃) yearsPrice (y) $1000s
1500310300
200045400
1200215250
180038350
250043500

The algorithm used to calculate the coefficients is too complex to do by hand, it uses matrix calculations. So we’ll just use software:

from sklearn.linear_model import LinearRegression
X = [
[1500, 3, 10], # [size, bedrooms, age]
[2000, 4, 5],
[1200, 2, 15],
[1800, 3, 8],
[2500, 4, 3]
]
y = [300, 400, 250, 350, 500] # prices
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Get coefficients
print("β₀ (intercept):", model.intercept_) # -312.5
print("β₁ (size coef):", model.coef_[0]) # 0.25
print("β₂ (bedrooms coef):", model.coef_[1]) # 37.5
print("β₃ (age coef):", model.coef_[2]) # 12.5

Interpreting the coefficients:

  • β₀ = -312.5: Base price (intercept) — theoretically what a house with 0 sq ft, 0 bedrooms, and 0 age would cost. Not meaningful in real terms, just a mathematical baseline.
  • β₁ = 0.25: For each additional square foot, price increases by $250
  • β₂ = 37.5: For each additional bedroom, price increases by $37,500
  • β₃ = 12.5: For each additional year of age, price increases by $12,500 (This seems weird, older houses should cost less. But with this small dataset, it shows a positive relationship. A larger dataset would catch the correct trend.)

Substituting them in:

y=312.5+0.25x1+37.5x2+12.5x3y = -312.5 + 0.25x_1 + 37.5x_2 + 12.5x_3

Now let’s predict a house with: Size = 2200 sq ft, Bedrooms = 3, Age = 7 years:

Price=312.5+0.25(2200)+37.5(3)+12.5(7)\text{Price} = -312.5 + 0.25(2200) + 37.5(3) + 12.5(7)

=312.5+550+112.5+87.5=437.5= -312.5 + 550 + 112.5 + 87.5 = 437.5

Predicted price: $437,500

NOTE

When you run the code, you may see results like -312.50000000000034, 0.25000000000000006, etc. That’s perfectly normal, it’s just floating-point precision in computers. They’re essentially just -312.5, 0.25, and so on.