[ < ] Homepage

The Reality Check: How bad is my model?

NOTE:
The concept of Error Metrics and Fit Metrics are generally applicable to all types of ML models, but the formulas and interpretations may vary for different types of models (like classification, regression, etc.). Here, we will focus on regression models.
This is not particularly limited to Linear Regression.

1. The Error Metrics - The Lower the Better

As mentioned before, there are several metrics for error calculations and each are optimised for different data and predictions.
It is advised to use atleast two of them to evaluate your model.

NOTE: In the below formulas,

n is the number of datapoints,
y is the actual value,
ȳ is the mean value,
ŷ is the predicted value
L₁ and L₂ are types of losses (for MAE and MSE calculations respectively).

MSE (Mean Squared Error):

L_2 = RSS = \sum(\text{Predicted Y}- \text{Actual Y})^2 $$ $$ MSE = \frac{1}{n}L_2 = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

MSE amplifies the bigger errors and diminishes the already smaller ones since they're squared. That's why, a low MSE value doesn't always mean a better model.

For example:
A difference of 10 between predicted and actual -> becomes 100
A difference of 0.2 between predicted and actual -> becomes 0.04
This means that MSE is more sensitive to outliers.

If you don't want to take outliers (data that doesn't correlate with the rest of the data) into account, skip MSE.


        from sklearn.metrics import mean_squared_error

        true_values = [X,Y,Z]
        predicted_values = [X_p,Y_p,Z_p]

        mse = mean_squared_error(true_values, predicted_values)

MAE (Mean Absolute Error)

L_1 = \sum|\text{Predicted Y}- \text{Actual Y}| $$ $$ MAE = \frac{1}{n}L_1 =\frac{1}{n}\sum_{i=1}^{n}|\hat{y}_i- y_i|

We take the absolute value of the difference since we don't want errors to cancel each other out and in the end, not contribute to the metric.
For example: A datapoint's error of +10 won't cancel out another datapoint's error of -10 (since they both are taken as abs(10) -> 10 )

MAE is more robust (less sensitive) to outliers since it doesn't square the errors. But then, it doesn't penalise large errors and shows no direction of the error (positive or negative). Moreover, it's not differentiable at x=0 (like MSE) which makes it difficult to use in gradient descent calculations.

MAE is more human-interpretable as its units are the same as the target variable. When introducing the model, you can say "On average, our model is off by {MAE} units"


        from sklearn.metrics import mean_absolute_error

        true_values = [X,Y,Z]
        predicted_values = [X_p,Y_p,Z_p]

        mae = mean_absolute_error(true_values, predicted_values)

RMSE (Root Mean Squared Error)

RMSE = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2}

This is similar to MAE in terms of interpretation. However, RMSE gives a relatively higher weight to large errors since the errors are squared before averaging.

RMSE also gives the standard deviation of the prediction errors (residuals). This means it tells you how spread out the errors are and is more skewed by larger errors.


        from sklearn.metrics import mean_squared_error
        import numpy as np #for square root

        true_values = [X,Y,Z]
        predicted_values = [X_p,Y_p,Z_p]

        rmse = np.sqrt(mean_squared_error(true_values, predicted_values)) #Just square root of MSE
        rmse = mean_squared_error(true_values, predicted_values, squared=False) #with numpy

Some other Error Metrics:

Mean Absolute Percentage Error (MAPE)
Percentage error between actual and predicted values $MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|\hat{y}_i- y_i|}{y_i}\times100$ No units: Scale Independent
Interpretable: "Our Model is off by {MAPE} %"

Mean Squared Logarithmic Error (MSLE) or Root Mean Squared logarithmic Error(RMSLE):
Both are by the same concept: Log of MSE and Log of RMSE respectively
Taking their log values slows down larger errors -> but smaller errors are same with MSE/MSLE (or RMSE/RMSLE).
So in a way, it penalises smaller errors and used in situations where you want to target smaller errors but not change/care about larger errors.

Huber Loss:

MSE and MAE

2. The Fit Metrics: The Higher the Better

R² (R-squared):

It's a statistic that measures how well the model explains the variance in the target variable - indication of goodness of fit

\text{Proportion of data explained by my model to the total data} = 1 - \text{Proportion of data NOT explained by my model to the total data}$$ $$R^2 = 1-{\frac{\text{SSE}}{\text{SST}}}$$ Let's simplify it and see: $$R^2 = {\frac{\text{SST-SSE}}{\text{SST}}} = {\frac{\text{SSR}}{\text{SST}}}

But what are these metrics?? What is SSE, SST or SSR?

SST (Total Sum of Squares): Distance from each point to the mean line ---> Measures total variation in the data

SSE (Error Sum of Squares): Distance from each point to regression line ---> Measures unexplained variation (errors/residuals)

SSR (Regression Sum of Squares): Distance from regression line to mean ---> Measures variation explained by the model

One thing to note, these values are squared and added for each point so the formuals:

SST = \sum (y_i - \bar{y})^2 ---> \text{(actual - mean)}^2$$ $$SSE = \sum (y_i - \hat{y}_i)^2 ---> \text{(actual - regression)}^2$$ $$SSR = \sum (\hat{y}_i - \bar{y})^2 ---> \text{(regression - mean)}^2

The graph helps us understand what these metrics are by calculating them for a single point (Orange Point).
From this, we can understand R Squared simply say if our line is better at predicting Y than an average line and by how much.

From this, we can understand the base formula using which R squared is calculated. $$\text{SST} = \text{SSR+SSE}$$ And, the proportion of SSR with respect to SST is our R²

In simple words, if R² = 0.8, it means that 80% of the variance in our Y is explained by our X variable/s. It's basically saying, if we predict the average value for every data point, is that error worse than our model's error and by how much? However, it doesn't measure the model's strength, just how well X explains Y.

It ranges from 0 to 1, where:

0 means the model doesn't explain any variance
1 means the model perfectly explains all variance

However, in some cases, it can be negative if SSE > SST where error from the mena line is smaller than the error from our model line. This just means, our model is very bad, worse than the base average model and we need to change it.


        from sklearn.metrics import r2_score

        true_values = [X,Y,Z]
        predicted_values = [X_p,Y_p,Z_p]

        r2 = r2_score(true_values, predicted_values)

Adjusted R²:

Adjusted R² is a modified version of R² that takes in the number of predictors (X variables) in the model while calculating the R². It's useful in multiple linear regression where we use multiple X variables to predict a single Y variable.

I will go in depth in later posts.

3. The Validation Strategy: How to use these metrics?

Knowing the formulas is only half the battle. If you calculate $R^2$ on the same data the model learned from, you aren't checking "how bad" the model is but instead you're just checking how good its memory is. This is called Overfitting. Our model's performance can only be evaluated with new datasets. This is the testing data. To evaluate a model, we need to split our data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.

A common split ratio is 80-20 or 70-30, where 80% or 70% of the data is used for training and the rest for testing.

After training the model on the training set and calculating its metrics, we use it to make predictions on the testing set and calculate the error and fit metrics discussed above.
The final grade evaluated using the testing dataset is what that matters. The metrics calculated from training set is used to optimise our model.
If someone asks, "What is the accuracy of this model?" they are asking for the testing evaluation.

Moreover, like mentioned in the beginning, always use more than one error metric. For example, MAE scores might have been very low on our model which makes it good but MSE would have been high. This can be interpreted that we have few large outliers which may have been ignored by MAE.

3.1 Cross Validation:

However, training on one set and testing on one set alone can sometimes not give the best results. Maybe by luck, it can also overestimate our model's performance.
Hence, there's a method called Cross Validation which is a sampling technique to prevent these issues.

Splitting the dataset into several parts.
Training the model on some parts and testing it on the remaining part.
Repeating this resampling process multiple times by choosing different parts of the dataset.
Averaging the results from each validation step to get the final performance.

There are several types of cross validation techniques. One of the more popular ones is K-Fold.

In K-Fold Validation, our data is split into k parts(folds). It's first trained by k-1 folds and tested by the remaining fold. It's repeated k times until every fold has been tested once.
For example, our data is divided into 3 parts: I, II, III. The sampling process consists of 3 iterations.

III

Data is trained by I,II and tested by III
Data is trained by II,III and tested by I
Data is trained by I,III and tested by II

Comparison of our train and test metrics can tell us a lot about our model's performance.

Result	Term	Meaning
Low Train Error / High Test Error	Overfitting/High Variance	The model "memorized" the training data so failed with new data.
High Train Error / High Test Error	Underfitting/High Bias	The model didn't learn anything at all; it's too simple.
Low Train Error / Low Test Error	Good Fit	What we want is this

Learning Resources:

StatQuest with Josh Starmer: Linear Regression, Clearly Explained!!!

Geeksforgeeks: Regression Metrics

Kaggle: Regression Models Evaluation Metrics

Prof. Ryan Ahmed: Machine Learning Regression Models Metrics

Thank you for reading.
- EXIT -