In Machine Learning, Classification tasks involve predicting categorical labels for input data, and having robust evaluation metrics that capture the nuances of these predictions is crucial. If you want to know how to evaluate the performance of Classification models, this article is for you. In this article, I’ll take you through a guide to all classification metrics and their implementation using Python.
Classification Metrics in Machine Learning
To understand all the classification metrics, let’s train a model and go through each metric to understand how they work. Let’s start building a classification model first:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import SGDClassifier # Generate a binary classification dataset X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42) # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) clf = SGDClassifier(loss='log', penalty='l2', alpha=0.0001, max_iter=1000, tol=1e-3, random_state=42) clf.fit(X_train, y_train) y_pred = clf.predict(X_test)
Here, we are creating a synthetic dataset suitable for a binary classification problem, splitting it into training and test sets, training a logistic regression model via stochastic gradient descent on the training set, and then predicting class labels for the test set.
Now, let’s go through all the classification metrics one by one.
Accuracy
Accuracy is a crucial evaluation metric that provides insight into the performance of classification models. It measures the overall accuracy of predictions made by a model, revealing how well it ranks instances. Accuracy is determined by comparing the number of correctly classified instances to the total number of instances in the dataset.
The desired result for accuracy is a high score because it indicates a model that made a large number of correct predictions. A high accuracy score implies that the model performs well in classifying instances and capturing underlying patterns in the data.
Here’s how to calculate the accuracy score of your model:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)Accuracy: 0.835
An accuracy score of 0.835 suggests that the classification model achieved an overall accuracy rate of 83.5% on the test dataset. It indicates that the model correctly classified approximately 83.5% of the instances in the test set.
Precision
Precision is a crucial evaluation metric used to assess the performance of a classification model. It measures the proportion of true positive predictions and overall positive predictions made by the model. Precision provides information about the model’s ability to accurately identify positive instances, with an emphasis on minimizing false positive predictions.
The desired result for precision is a high score because it indicates that the model makes a minimal number of false positive predictions. In other words, the model correctly identifies positive instances without misclassifying negative instances as positive.
Here’s how to calculate the precision of your model:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred)
print("Precision: ", precision)Precision: 0.8362068965517241
A precision score of 0.836 indicates that the classification model achieved an accuracy rate of 83.6% on the test dataset. Precision measures the proportion of true positive predictions out of all positive predictions made by the model.
Recall
Recall, also known as sensitivity or true positive rate, is an important evaluation metric used to assess the performance of a classification model. It measures the proportion of true positive predictions out of all true positive instances in the data set.
The desired outcome for recall is a high score, indicating that the model successfully captures a large proportion of positive instances. A high recall score suggests that the model effectively minimizes false negatives and correctly identifies positives.
Here’s how to calculate the recall of your model:
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred)
print("Recall: ", recall)Recall: 0.8738738738738738
A recall score of 0.87 indicates that the classification model achieved an 87% recall rate on the test dataset.
F1 Score
The F1 score is a widely used evaluation measure that combines precision and recall into a single measure, providing a balanced assessment of a classification model’s performance. It quantifies the trade-off between precision and recall, providing insight into the model’s ability to simultaneously minimize false positives and false negatives.
The desired result for the F1 score is a high value, suggesting that the model strikes a balance between precision and recall, effectively minimizing false positives and false negatives.
Here’s how to calculate the F1 score of your model:
from sklearn.metrics import f1_score
f1score = f1_score(y_test, y_pred)
print("F1 Score: ", f1score)F1 Score: 0.8546255506607928
In this case, an F1 score of 0.85 suggests that the model strikes a relatively good balance between precision and recall.
Confusion Matrix
The confusion matrix is structured as a table that includes four important components: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Each component represents the number or number of instances falling into a particular prediction category.
| Predicted Positive | Predicted Negative | |
| Actual Positive | True Positives (TP) | False Negatives (FP) |
| Actual Negative | False Positives (FP) | True Negatives (TN) |
The desired result for the confusion matrix is to have a high number of true positives and true negatives while minimizing the number of false positives and false negatives. A balanced and accurate model would ideally have high values for TP and TN, and low values for FP and FN.
Here’s how to calculate the confusion matrix of your model:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)Confusion Matrix:
[[70 19]
[14 97]]
Looking at the matrix:
- The top left cell represents the number of true positives (TP), which is 70. These are cases where the model correctly predicted the positive class.
- The top right cell represents the number of false negatives (FN), which is 19. These are cases where the model incorrectly predicted the negative class when the actual class was positive.
- The bottom left cell represents the number of false positives (FP), which is 14. These are cases where the model incorrectly predicted the positive class when the actual class was negative.
- The bottom right cell represents the number of true negatives (TN), which is 97. These are cases where the model correctly predicted the negative class.
AUC & ROC
The area under the curve (AUC) and receiver operating characteristics (ROC) curve are common evaluation measures used in binary classification tasks. They provide insight into the performance and discriminating power of a classification model.

The ROC curve is a graphical representation of the model’s performance at different classification thresholds. It plots the true positive rate (TPR) against the false positive rate (FPR) for different threshold values. The TPR represents the proportion of true positive instances correctly classified as positive, while the FPR represents the proportion of true negative instances incorrectly classified as positive.
AUC is the numerical value that represents the area under the ROC curve. It summarizes the ability of the model to distinguish between positive and negative classes for all possible threshold values.
A higher AUC score indicates a better-performing model with a greater ability to discriminate between classes. The ROC curve, when closer to the upper left corner, demonstrates a higher true positive rate (TPR) versus false positive rate (FPR) across various classification thresholds. This indicates that the model has a better balance between correctly identifying positive instances and minimizing false positive errors.
Here’s how to use AUC & ROC to evaluate the performance of your classification model:
import plotly.graph_objects as go
from sklearn.metrics import roc_auc_score, roc_curve
# Calculating the false positive rate, true positive rate, and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
# Creating the ROC curve
roc_trace = go.Scatter(
x=fpr,
y=tpr,
name="ROC Curve",
mode="lines",
line=dict(color="green")
)
# Creating the diagonal line
diag_trace = go.Scatter(
x=[0, 1],
y=[0, 1],
name="Diagonal",
mode="lines",
line=dict(color="gray", dash="dash")
)
# Creating the layout
layout = go.Layout(
title="AUC & ROC Curve",
xaxis=dict(title="False Positive Rate"),
yaxis=dict(title="True Positive Rate"),
showlegend=True,
)
# Creating the figure
fig = go.Figure(data=[roc_trace, diag_trace], layout=layout)
# Adding the AUC score to the plot
fig.add_annotation(
x=0.5,
y=0.1,
text=f"AUC = {auc:.4f}",
showarrow=False,
font=dict(size=16),
)
# Show the plot
fig.show()
An AUC score of 0.8302 implies that, on average, the model correctly ranks a randomly selected positive instance higher than a randomly selected negative instance about 83.02% of the time.
So these were all the classification metrics in Machine Learning you should know.
Summary
In Machine Learning, Classification tasks involve predicting categorical labels for input data, and having robust evaluation metrics that capture the nuances of these predictions is crucial. I hope you liked this article on Classification metrics in Machine Learning and their implementation using Python. You can learn many more concepts of Machine Learning from my book on Machine Learning Algorithms. Feel free to ask valuable questions in the comments section below.





