When building machine learning models, it is essential to have a baseline by which you can compare the performance of the model you are building. The baseline you set gives you the benefit of knowing whether you are on track and whether your model is improving or not.
The type of baseline you set is usually dependent on the type of Machine Learning problem and solution you are providing.
For classification problem in which you are predicting which class or category a set of variables belong to you will need a different method to get the baseline. The same goes for regression problems.
By the end of this tutorial you will learn about
- What we mean by a baseline for an ML Model
- The various methods you can set a baseline
- Set a baseline for your model
Methods For Setting A Baseline
There are several methods that we can utilize. These include
- Using cross_val_score()
- Using a DummyClassifier()
When setting up the baseline model for a regression model, you can utilize the central tendency of the data. These include the mean, median or mood.
For classification task, you can use either stratification or otherwise.
Setting Baseline For Classifier ML Model
The first method involves the use of a dummy classifier which doesn’t learn from the data. There is also a dummy regressor for regression problems.
Fortunately we can use DummyClassifier estimator from scikit-learn. This estimator requires a strategy to be used to benchmark the prediction. These strategy includes
Let us check the code below.
# Load Package
from sklearn.dummy import DummyClassifier
# Initialize Estimator
dummy_clf = DummyClassifier(strategy='stratified')
# Check for Model Accuracy
You can alternate the type of strategy per your task. The result of the Dummy Classifier will become a baseline that we can use as a form of benchmark against which we can compare the various ML Estimators we used when building the main model.
Any value close to this result or better implies that our model is improving and is on track.
The other method involves the use of cross_val_score whereby you pick a particular Estimator , and the datasets (X, y) and split it over N folds for the cross validation. The mean and the std of the cross validation results can then be used as a baseline for you final model.
from sklearn.model_selection import cross_val_score
import numpy as np
cv_results = cross_val_score(Estimator(),X,y,cv=5,scoring='accuracy)
# Find the mean & std
These are some of the ways we can set a baseline for a machine learning classifier model. Ofcourse there are many methods but these are some of them.You can check out our materials for more interesting aspect of Machine Learning in Python.