Generating Machine Learning Pipelines using Tpot and Genetic Programming in Python

Automated Machine learning is a fascinating field of machine learning and Data Science in which we automate some of the stages of ML life cycle such as feature engineering, model building,etc.

It mostly works best after the data has been cleaned and pre-processed properly.

There are several libraries and tools that can be used in AutoML among them include

  • Tpot
  • AutoML
  • Auto Sklearn
  • Auto-Keras
  • Featuretools
  • MLBox

In this tutorial, we will be exploring Tpot.

So what is Tpot?

TPOT is a python library that uses genetic programming behind the scenes to generate an optimized ML pipeline. It uses the concept of natural selection, survival of the fittest and mutation to find the best machine learning model and the required parameters to produced the best result.

TPOT is an abbreviation that stands for  Tree-based Pipeline Optimization Tool. It depends on powerful libraries like sklearn. Like the other auto-ml libraries it makes our work easier and saves us time to focus on the other aspects of the ML life cycle.

Let us see how to work with TPOT.

Installation

pip install tpot

TPOT has two main classes for Classification and Regression Problems. In our case we will be using the TPOTClassifier for generating our ML pipeline.  One thing to take notice of is that you will need a powerful system for it to work faster, hence in our case we will be using Google Colab, but you can try is locally too.

TPOT utilizes these 3 concepts

  • Selection: selecting the best
  • Crossover: cross breed the selected best over and over to get the best of the best
  • Mutation: mutate the best to become even more better.

We will be using the famous iris dataset and then try different individual algorithms such as logisticRegression and RandomForest. In most cases, when building a model you will have to do some hyper-parameter tuning as well as trying different algorithms to find the best and optimal model. But with TPOT , that task is done for you.

Let us see what I mean. In the first part we will use individual algorithms from Sklearn and try a cross_val_score of N = 10 and then find the average. This means that we are tuning our model in several ways to find the best score.

We will then try the same with another algorithm with their default values and then one with modified values and then finally with TPOT.

 

# Iris Data Sources
data_url = "https://raw.githubusercontent.com/Jcharis/Machine-Learning-Web-Apps/master/Iris-Species-Predictor-ML-Flask-App-With-Materialize.css/data/iris.csv"
In [0]:
# Load ML Pkgs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
In [0]:
# Load EDA pkg
import pandas as pd 
import numpy as np
In [0]:
# Load Our dataset
df = pd.read_csv(data_url)
In [23]:
df.head()
Out[23]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [24]:
# Checking for missing
df.isnull().sum()
Out[24]:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
In [25]:
df.columns
Out[25]:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
In [0]:
# Species to Numerical
d = {value:index for index,value in enumerate(df['species'].unique())}
In [28]:
d
Out[28]:
{'setosa': 0, 'versicolor': 1, 'virginica': 2}
In [0]:
df['new_label'] = df['species'].map(d)
In [30]:
df.head()
Out[30]:
sepal_length sepal_width petal_length petal_width species new_label
0 5.1 3.5 1.4 0.2 setosa 0
1 4.9 3.0 1.4 0.2 setosa 0
2 4.7 3.2 1.3 0.2 setosa 0
3 4.6 3.1 1.5 0.2 setosa 0
4 5.0 3.6 1.4 0.2 setosa 0
In [0]:
xfeatures = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
ylabels = df['new_label']
In [0]:
from sklearn.model_selection import cross_val_score
In [33]:
# Individual Algorithm
cv_scores = cross_val_score(LogisticRegression(),xfeatures,ylabels,cv=10)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In [34]:
cv_scores
Out[34]:
array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])
In [35]:
print(np.mean(cv_scores))
0.9733333333333334
In [0]:
# Individual Algorithm
rf_cv_scores = cross_val_score(RandomForestClassifier(),xfeatures,ylabels,cv=10)
In [40]:
rf_cv_scores
Out[40]:
array([1.        , 0.93333333, 1.        , 0.93333333, 0.93333333,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])
In [37]:
print(np.mean(rf_cv_scores))
0.9666666666666666
Using RandomForestClassifier with the default params gave us 0.966. Let us modify our params and the depth of the forest/tree and the estimators and let us see what we will get.
In [0]:
# Individual Algorithm
rf_cv_scores2 = cross_val_score(RandomForestClassifier(n_estimators=100,max_depth=2),xfeatures,ylabels,cv=10)
In [39]:
rf_cv_scores2
Out[39]:
array([0.93333333, 0.93333333, 1.        , 0.93333333, 0.93333333,
       0.93333333, 0.86666667, 1.        , 1.        , 1.        ])
In [41]:
print(np.mean(rf_cv_scores2))
0.9533333333333334
Based on our result the best result for the LogisticRegression and RandomForestClassifier (with default) and RandomForestClassifier(tuned) were  0.97,0.96 and 0.95 respectively.
You can observe that even tuning the parameters for the RandomForestClassifier gave us a different result as  compared to the one with the default values. This is where TPOT comes in to make it easier to find not just the best Algorithm but also the right parameters to use.
Working with TPOT
In [0]:
### AutoML with TPOT
In [42]:
!pip install tpot
Collecting tpot
  Downloading https://files.pythonhosted.org/packages/37/d8/719024ea20497eb6566ed5cc070e66e8c1e125e0e5d9966837cd00a3a83d/TPOT-0.11.2-py3-none-any.whl (76kB)
     |████████████████████████████████| 81kB 3.4MB/s 
Collecting deap>=1.2
  Downloading https://files.pythonhosted.org/packages/0a/eb/2bd0a32e3ce757fb26264765abbaedd6d4d3640d90219a513aeabd08ee2b/deap-1.3.1-cp36-cp36m-manylinux2010_x86_64.whl (157kB)
     |████████████████████████████████| 163kB 15.1MB/s 
Collecting stopit>=1.1.1
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Requirement already satisfied: joblib>=0.13.2 in /usr/local/lib/python3.6/dist-packages (from tpot) (0.14.1)
Requirement already satisfied: pandas>=0.24.2 in /usr/local/lib/python3.6/dist-packages (from tpot) (1.0.3)
Requirement already satisfied: tqdm>=4.36.1 in /usr/local/lib/python3.6/dist-packages (from tpot) (4.41.1)
Requirement already satisfied: scikit-learn>=0.22.0 in /usr/local/lib/python3.6/dist-packages (from tpot) (0.22.2.post1)
Requirement already satisfied: scipy>=1.3.1 in /usr/local/lib/python3.6/dist-packages (from tpot) (1.4.1)
Collecting update-checker>=0.16
  Downloading https://files.pythonhosted.org/packages/d6/c3/aaf8a162df8e8f9d321237c7c0e63aff95b42d19f1758f96606e3cabb245/update_checker-0.17-py2.py3-none-any.whl
Requirement already satisfied: numpy>=1.16.3 in /usr/local/lib/python3.6/dist-packages (from tpot) (1.18.4)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.2->tpot) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.2->tpot) (2.8.1)
Requirement already satisfied: requests>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from update-checker>=0.16->tpot) (2.23.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas>=0.24.2->tpot) (1.12.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2020.4.5.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2.9)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py) ... done
  Created wheel for stopit: filename=stopit-1.1.2-cp36-none-any.whl size=11956 sha256=0f4ef274cedf5bacc6a4962ec3557d79ab6bae70783af582a93b404e650ccb32
  Stored in directory: /root/.cache/pip/wheels/3c/85/2b/2580190404636bfc63e8de3dff629c03bb795021e1983a6cc7
Successfully built stopit
Installing collected packages: deap, stopit, update-checker, tpot
Successfully installed deap-1.3.1 stopit-1.1.2 tpot-0.11.2 update-checker-0.17
In [0]:
import tpot
In [44]:
# Methods and Attributes
dir(tpot)
Out[44]:
['TPOTClassifier',
 'TPOTRegressor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_version',
 'base',
 'builtins',
 'config',
 'decorators',
 'driver',
 'export_utils',
 'gp_deap',
 'gp_types',
 'main',
 'metrics',
 'operator_utils',
 'tpot']
In [0]:
# Split in train and test
x_train,x_test,y_train,y_test = train_test_split(xfeatures,ylabels,test_size=0.3,random_state=42)
In [0]:
# Init
tpot = TPOTClassifier(generations=5,verbosity=2)
In [51]:
# Fit data
tpot.fit(x_train,y_train)
HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=600.0, style=ProgressStyle(de…
Generation 1 - Current best internal CV score: 0.9714285714285713
Generation 2 - Current best internal CV score: 0.9714285714285715
Generation 3 - Current best internal CV score: 0.9714285714285715
Generation 4 - Current best internal CV score: 0.9714285714285715
Generation 5 - Current best internal CV score: 0.9714285714285715

Best pipeline: KNeighborsClassifier(Nystroem(input_matrix, gamma=0.25, kernel=cosine, n_components=4), n_neighbors=20, p=1, weights=distance)
Out[51]:
TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=5,
               log_file=<ipykernel.iostream.OutStream object at 0x7f05559644a8>,
               max_eval_time_mins=5, max_time_mins=None, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=100,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)
In [52]:
tpot.score(x_test,y_test)
Out[52]:
1.0
In [0]:
# Export the result
tpot.export('tpot_ml_pipeline.py')
TPOT will save the pipeline together with the entire code in a .py file which you can utilize for your task. You can even copy the code for the ML algorithm and utilize them if you don’t want to use the entire pipeline. You can check out the generated pipeline here
Let us try to use our pipeline to do some prediction.
In [0]:
# Predictions
ex = np.array([6.2,3.4,5.4,2.3]).reshape(1,-1)
In [57]:
tpot.predict(ex)
Out[57]:
array([2])
In [58]:
d
Out[58]:
{'setosa': 0, 'versicolor': 1, 'virginica': 2}

 

To conclude, TPOT is quite useful when generating ML and it can save you a lot of time to concentrate on the aspect of the Data Science/ML life cycle.

You can also check out the video tutorial below

 

Thanks For Your Time

Jesus Saves

By Jesse E.Agbe(JCharis)

Leave a Comment

Your email address will not be published. Required fields are marked *