Adjust the slope (w) and intercept (b) to fit the line to the data points.
1. The Concept
Regression in Machine Learning is a method to predict continuous values of a target variable.
In simple terms: predicting the price of a new house based on its size using its previous data of house prices and sizes.
We do this by finding a pattern in the data.
Linear Regression is when that pattern is a line.
School Math:
Y = mX + c
m = slope
c = intercept
Machine Learning:
Y = wX + b
w = weight (importance)
b = bias (offset)
Building the model is just simply figuring out the [ w ] and [ b ]
that best fit our data. This is where the math comes in.
1.5 Initialization (The Guess)
Like everything, we need a starting point. Before the model learns anything,
it has to make a default guess.
INITIAL STATE:
w = 0(or 1), b = 0(or 1)
Now, we plug these into our equation ($y = wx + b$) to make a prediction for
every $x$ data point we have.
2. The Cost Function (MSE)
Now, we have a rough line of equation. How good can our line predict Y for a new data point X?
Where does it predict it wrong? How wrong?
There is sometimes a gap between the line and the actual dot. This is theError or Residual.
(You can see the red line in the above graph).
We use this to grade our line.
Like how grades have different forms: marks, percentages, gpas etc...
We have different ways to calculate this loss. These are called Loss/Cost functions.
But for now, let's focus on one of them: MSE(Mean Squared Error)
1. THE ERROR:
$$ ( \text{Predicted Y} - \text{Actual Y} ) $$
2. SQUARE IT:
$$ ( \text{Predicted Y} - \text{Actual Y} )^2 $$
* We square it so negatives become positives. Otherwise, they might just cancel one another.
3. AVERAGE IT:
$$ \frac{1}{n} \left[ ( \text{Predicted Y}_1 - \text{Actual Y}_1 )^2 + ( \text{Predicted Y}_2 - \text{Actual Y}_2 )^2 + \dots + ( \text{Predicted Y}_n - \text{Actual Y}_n )^2 \right] $$
* So basically, add every square difference of each data point and divide it by n.
Average squared error of each point.
n = number of data points.
If we translate that into proper math notation, it looks like this:
$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 $$
3. Gradient Descent
This is basically, running through different values for w and b until the MSE is at the lowest.
So, by this, we will have our best line which will best represent our data.
But, how can we know the next best ('better') value of w and b? This is where Gradient Descent comes in.
w_new = w_old - learning_rate * - gradient
b_new = b_old - learning_rate * - gradient
Let's understand gradient first.
$$ \text{Gradient of weight} = \frac{\partial L}{\partial w}$$
$$ \text{Gradient of bias} = \frac{\partial L}{\partial b}$$
This basically tells us how much L(Loss) changes when w or b changes.
1. Gradient for Weight ($w$):$$\frac{\partial L}{\partial w} = -\frac{2}{n} \sum_{i=1}^{n} x_i (y_i - \hat{y}_i)$$
2. Gradient for Bias ($b$):$$\frac{\partial L}{\partial b} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)$$
If you want to see the calculations to get the gradients, click below. (However, they're not necessary to understand this)
Gradient w.r.t weight:
$$L = \frac{1}{n} \sum (y - \hat{y})^2$$
$$L = \frac{1}{n} \sum (y - (wx + b))^2$$
$$\hat{y} = wx + b$$
$$L = \frac{1}{n} \sum u^2 \text{ where } u = (y - \hat{y})$$
$$\frac{\partial L}{\partial w} = \frac{1}{n} \sum 2u \frac{\partial u}{\partial w}$$
$$\frac{\partial u}{\partial w} = -x$$
$$\frac{\partial L}{\partial w} = \frac{2}{n} \sum (y - \hat{y})(-x)$$
$$\text{Final: }\boxed{\frac{\partial L}{\partial w} = -\frac{2}{n} \sum x(y - \hat{y})}$$
Gradient w.r.t bias:
$$L = \frac{1}{n} \sum (y - \hat{y})^2$$
$$L = \frac{1}{n} \sum (y - (wx + b))^2$$
$$\hat{y} = wx + b$$
$$L = \frac{1}{n} \sum u^2 \text{ where } u = (y - \hat{y})$$
$$\frac{\partial L}{\partial b} = \frac{1}{n} \sum 2u \frac{\partial u}{\partial b}$$
$$\frac{\partial u}{\partial b} = -1$$
$$\frac{\partial L}{\partial b} = \frac{2}{n} \sum (y - \hat{y})$$
$$\textbf{Final: } \boxed{\frac{\partial L}{\partial b} = -\frac{2}{n} \sum (y - \hat{y})}$$
Imagine you are standing on a foggy mountain (the Loss/Error) and your goal is to get down (the Minimum Error).
Because of the fog, you can’t see the bottom. You can only feel the slope of the ground under your feet.
The Gradient:
This is the slope. It always points to the rise/increase in height.
The Minus Sign:
Since the gradient points up the hill (where the error is worse), you simply turn $180^\circ$ and walk the other way.
The Learning Rate (α):
This is the size of your step.
- Step too small? It takes forever to reach the bottom.
- Step too large? You might overstep and end up on the other side.
Our learning rate(α) is one of the things to modify in case our model isn't working well.
(Will discuss in later posts)
$$ w_{\text{new}} = w_{\text{old}} - \alpha * \left( -\frac{2}{n} \sum x (y - \hat{y}) \right) $$
$$ b_{\text{new}} = b_{\text{old}} - \alpha * \left( -\frac{2}{n} \sum (y - \hat{y}) \right) $$
We will keep finding the new values using the previous ones for a number of times (Iterations or Epochs) until we reach the lowest MSE, after which, any calculation done will only increase the MSE.
So, the last iteration which gives our w and b: our model essentially.
Coding it from Scratch:
Now that the general math is down, converting it to code is the next step.
Import the data into the program. For the simplicity, we can generate a dataset.
import numpy as np
import random
X = np.array(range(1, 31))
y = np.array([3 * x + 10 + random.uniform(-5, 5) for x in X])
Coding Linear Regression from scratch requires 3 functions:
1. Loss function:
def compute_loss(X, y, w, b):
n = len(X)
y_pred = w * X + b
loss = (1/n) * np.sum((y - y_pred) ** 2)
return loss
2.Gradient Calculation Function:
def compute_gradients(X, y, w, b):
n = len(X)
y_pred = w * X + b
dw = (-2/n) * np.sum(X * (y - y_pred))
db = (-2/n) * np.sum(y - y_pred)
return dw, db
3. Training loop:
- Initialize w and b
-
Predict y values
- Calculate loss
- Compute gradients
- Update w and b
def train_model(X, y, lr=0.001, epochs=2000):
w, b = 0.0, 0.0
for i in range(epochs):
dw, db = compute_gradients(X, y, w, b)
w -= lr * dw
b -= lr * db
return w, b
# Execute Training
final_w, final_b = train_model(X, y)
# Calculate final MSE
final_mse = compute_loss(X, y, final_w, final_b)
#Final Result Output for comparison:
print("--- DATA POINTS ---")
print(f"X_values = {X.tolist()}")
print(f"y_values = {y.tolist()}")
print("\n--- SCRATCH MODEL RESULTS ---")
print(f"Final w : {final_w}")
print(f"Final b : {final_b}")
print(f"Final MSE : {final_mse}")
Evaluating our model
There're a lot of methods to evaluate a model. For now, Let's test it against the standard scikit-learn model.
If both models give similar values for the slope(w) and intercept(b), and make similar predictions, it’s a good sign that our
implementation is working correctly.
Using Scikit-Learn
First, we import the required model and fit it to the same dataset. From below, we get the w, b and the loss values.
If they are similar to our model values, then we have built a good model by the standard.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X_reshaped = X.reshape(-1, 1) # Converts [1, 2, 3...] to [[1], [2], [3]...]
model = LinearRegression()
model.fit(X_reshaped, y)
sklearn_w = model.coef_[0] # This is the weight (slope)
sklearn_b = model.intercept_ # This is the bias (intercept)
y_pred = model.predict(X_reshaped)
sklearn_mse = mean_squared_error(y, y_pred)
print(f"Sklearn w : {sklearn_w}")
print(f"Sklearn b : {sklearn_b}")
print(f"Sklearn MSE : {sklearn_mse}")
My trained model's predictions with my dataset:
my_values
--- DATA POINTS ---
X_values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
y_values = [15.254283570527157, 17.86897944267892, 18.998444314585324, 24.592291750329466, 21.529331600204493, 28.145798569639098, 26.69310372304577, 32.477532335237804, 38.02946943870903, 37.72881956452786, 38.80167209849368, 42.598608600628985, 44.465378582988684, 52.70158807090953, 57.751461086852856, 58.05407123387017, 64.59833006933121, 67.49644757486193, 63.08322319430452, 67.451158081249, 72.60067148272732, 76.14315797073942, 78.43849919621611, 77.96162390880384, 82.59962549648499, 88.8696247198491, 86.2940175373369, 90.91969459438755, 99.34910529059002, 97.14852292808499]
--- SCRATCH MODEL RESULTS ---
Final w : 3.1311651743387614
Final b : 6.186644993166872
Final MSE : 10.02715667983034
--- SCIKIT-LEARN RESULTS ---
Sklearn w : 2.9439977044878605
Sklearn b : 9.989520114711347
Sklearn MSE : 6.589495376212132
--- COMPARISON ---
Difference in w: 0.18716746985090094
Difference in MSE: 3.4376613036182073
Linear regression is one of the foundational algorithms in machine learning,
but there’s still a lot more to explore. Even these topics aren't thorough. Concepts like
hyperparameters, different model evaluation metrics
like R², regularization techniques, and feature scaling all play
a big role in real-world applications.
Learning Resources that have Helped Me:
- StatQuest with Josh Starmer: Linear Regression, Clearly Explained!!!
- Geeksforgeeks: Linear Regression in Machine Learning
- MLU-EXPLAIN (A Literal Goldmine for Visual Learners: Check out the gradient descent viz especially): Linear Regression
I will continue updating this post as I learn more about them, as either sub posts or by editing this one.
Thank you for reading.
- EXIT -
n = number of data points
ŷ = predicted value
y = actual value