NOTE:
The concept of Error Metrics and Fit Metrics are generally applicable to all types of ML models, but the formulas and interpretations may vary for different types of models (like classification, regression, etc.). Here, we will focus on regression models.
This is not particularly limited to Linear Regression.
As mentioned before, there are several metrics for error calculations and each are optimised for different data and predictions.
It is advised to use atleast two of them to evaluate your model.
MSE amplifies the bigger errors and diminishes the already smaller ones since they're squared.
That's why, a low MSE value doesn't always mean a better model.
For example:
A difference of 10 between predicted and actual -> becomes 100
A difference of 0.2 between predicted and actual -> becomes 0.04
This means that MSE is more sensitive to outliers.
If you don't want to take outliers (data that doesn't correlate with the rest of the data) into account, skip MSE.
from sklearn.metrics import mean_squared_error
true_values = [X,Y,Z]
predicted_values = [X_p,Y_p,Z_p]
mse = mean_squared_error(true_values, predicted_values)
We take the absolute value of the difference since we don't want errors to cancel each other out and in the end, not contribute to the metric.
For example:
A datapoint's error of +10 won't cancel out another datapoint's error of -10 (since they both are taken as abs(10) -> 10 )
MAE is more robust (less sensitive) to outliers since it doesn't square the errors. But then, it doesn't penalise large errors and shows no direction of the error (positive or negative).
Moreover, it's not differentiable at x=0 (like MSE) which makes it difficult to use in gradient descent calculations.
MAE is more human-interpretable as its units are the same as the target variable. When introducing the model, you can say "On average, our model is off by {MAE} units"
from sklearn.metrics import mean_absolute_error
true_values = [X,Y,Z]
predicted_values = [X_p,Y_p,Z_p]
mae = mean_absolute_error(true_values, predicted_values)
This is similar to MAE in terms of interpretation. However, RMSE gives a relatively higher weight to large errors since the errors are squared before averaging.
RMSE also gives the standard deviation of the prediction errors (residuals). This means it tells you how spread out the errors are and is more skewed by larger errors.
from sklearn.metrics import mean_squared_error
import numpy as np #for square root
true_values = [X,Y,Z]
predicted_values = [X_p,Y_p,Z_p]
rmse = np.sqrt(mean_squared_error(true_values, predicted_values)) #Just square root of MSE
rmse = mean_squared_error(true_values, predicted_values, squared=False) #with numpy
But what are these metrics?? What is SSE, SST or SSR?
SST (Total Sum of Squares): Distance from each point to the mean line ---> Measures total variation in the data
SSE (Error Sum of Squares): Distance from each point to regression line ---> Measures unexplained variation (errors/residuals)
SSR (Regression Sum of Squares): Distance from regression line to mean ---> Measures variation explained by the model
One thing to note, these values are squared and added for each point so the formuals:
$$SST = \sum (y_i - \bar{y})^2 ---> \text{(actual - mean)}^2$$ $$SSE = \sum (y_i - \hat{y}_i)^2 ---> \text{(actual - regression)}^2$$ $$SSR = \sum (\hat{y}_i - \bar{y})^2 ---> \text{(regression - mean)}^2$$
The graph helps us understand what these metrics are by calculating them for a single point (Orange Point).However, in some cases, it can be negative if SSE > SST where error from the mena line is smaller than the error from our model line. This just means, our model is very bad, worse than the base average model and we need to change it.
from sklearn.metrics import r2_score
true_values = [X,Y,Z]
predicted_values = [X_p,Y_p,Z_p]
r2 = r2_score(true_values, predicted_values)
Knowing the formulas is only half the battle. If you calculate $R^2$ on the same data the model learned from, you aren't checking "how bad" the model is but instead
you're just checking how good its memory is. This is called Overfitting.
Our model's performance can only be evaluated with new datasets. This is the testing data.
To evaluate a model, we need to split our data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.
A common split ratio is 80-20 or 70-30, where 80% or 70% of the data is used for training and the rest for testing.
After training the model on the training set and calculating its metrics, we use it to make predictions on the testing set and calculate the error and fit metrics discussed above.
The final grade evaluated using the testing dataset is what that matters. The metrics calculated from training set is used to optimise our model.
If someone asks, "What is the accuracy of this model?" they are asking for the testing evaluation.
Moreover, like mentioned in the beginning, always use more than one error metric. For example, MAE scores might have been very low on our model which makes it good but MSE would have been high.
This can be interpreted that we have few large outliers which may have been ignored by MAE.
However, training on one set and testing on one set alone can sometimes not give the best results. Maybe by luck, it can also overestimate our model's performance.
Hence, there's a method called Cross Validation which is a sampling technique to prevent these issues.
| I | II | III | I | II | III | I | II | III |
| Result | Term | Meaning |
|---|---|---|
| Low Train Error / High Test Error | Overfitting/High Variance | The model "memorized" the training data so failed with new data. |
| High Train Error / High Test Error | Underfitting/High Bias | The model didn't learn anything at all; it's too simple. |
| Low Train Error / Low Test Error | Good Fit | What we want is this |
Thank you for reading.
- EXIT -