Automated Machine learning is a fascinating field of machine learning and Data Science in which we automate some of the stages of ML life cycle such as feature engineering, model building,etc.
It mostly works best after the data has been cleaned and pre-processed properly.
There are several libraries and tools that can be used in AutoML among them include
Tpot
AutoML
Auto Sklearn
Auto-Keras
Featuretools
MLBox
In this tutorial, we will be exploring Tpot.
So what is Tpot?
TPOT is a python library that uses genetic programming behind the scenes to generate an optimized ML pipeline. It uses the concept of natural selection, survival of the fittest and mutation to find the best machine learning model and the required parameters to produced the best result.
TPOT is an abbreviation that stands for Tree-based Pipeline Optimization Tool. It depends on powerful libraries like sklearn. Like the other auto-ml libraries it makes our work easier and saves us time to focus on the other aspects of the ML life cycle.
Let us see how to work with TPOT.
Installation
pip install tpot
TPOT has two main classes for Classification and Regression Problems. In our case we will be using the TPOTClassifier for generating our ML pipeline. One thing to take notice of is that you will need a powerful system for it to work faster, hence in our case we will be using Google Colab, but you can try is locally too.
TPOT utilizes these 3 concepts
Selection: selecting the best
Crossover: cross breed the selected best over and over to get the best of the best
Mutation: mutate the best to become even more better.
We will be using the famous iris dataset and then try different individual algorithms such as logisticRegression and RandomForest. In most cases, when building a model you will have to do some hyper-parameter tuning as well as trying different algorithms to find the best and optimal model. But with TPOT , that task is done for you.
Let us see what I mean. In the first part we will use individual algorithms from Sklearn and try a cross_val_score of N = 10 and then find the average. This means that we are tuning our model in several ways to find the best score.
We will then try the same with another algorithm with their default values and then one with modified values and then finally with TPOT.
# Iris Data Sources
data_url="https://raw.githubusercontent.com/Jcharis/Machine-Learning-Web-Apps/master/Iris-Species-Predictor-ML-Flask-App-With-Materialize.css/data/iris.csv"
In [0]:
# Load ML Pkgsfromsklearn.model_selectionimporttrain_test_splitfromsklearn.linear_modelimportLogisticRegressionfromsklearn.ensembleimportRandomForestClassifierfromsklearn.naive_bayesimportMultinomialNB
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Using RandomForestClassifier with the default params gave us 0.966. Let us modify our params and the depth of the forest/tree and the estimators and let us see what we will get.
Based on our result the best result for the LogisticRegression and RandomForestClassifier (with default) and RandomForestClassifier(tuned) were 0.97,0.96 and 0.95 respectively.
You can observe that even tuning the parameters for the RandomForestClassifier gave us a different result as compared to the one with the default values. This is where TPOT comes in to make it easier to find not just the best Algorithm but also the right parameters to use.
Working with TPOT
In [0]:
### AutoML with TPOT
In [42]:
!pipinstalltpot
Collecting tpot
Downloading https://files.pythonhosted.org/packages/37/d8/719024ea20497eb6566ed5cc070e66e8c1e125e0e5d9966837cd00a3a83d/TPOT-0.11.2-py3-none-any.whl (76kB)
|████████████████████████████████| 81kB 3.4MB/s
Collecting deap>=1.2
Downloading https://files.pythonhosted.org/packages/0a/eb/2bd0a32e3ce757fb26264765abbaedd6d4d3640d90219a513aeabd08ee2b/deap-1.3.1-cp36-cp36m-manylinux2010_x86_64.whl (157kB)
|████████████████████████████████| 163kB 15.1MB/s
Collecting stopit>=1.1.1
Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Requirement already satisfied: joblib>=0.13.2 in /usr/local/lib/python3.6/dist-packages (from tpot) (0.14.1)
Requirement already satisfied: pandas>=0.24.2 in /usr/local/lib/python3.6/dist-packages (from tpot) (1.0.3)
Requirement already satisfied: tqdm>=4.36.1 in /usr/local/lib/python3.6/dist-packages (from tpot) (4.41.1)
Requirement already satisfied: scikit-learn>=0.22.0 in /usr/local/lib/python3.6/dist-packages (from tpot) (0.22.2.post1)
Requirement already satisfied: scipy>=1.3.1 in /usr/local/lib/python3.6/dist-packages (from tpot) (1.4.1)
Collecting update-checker>=0.16
Downloading https://files.pythonhosted.org/packages/d6/c3/aaf8a162df8e8f9d321237c7c0e63aff95b42d19f1758f96606e3cabb245/update_checker-0.17-py2.py3-none-any.whl
Requirement already satisfied: numpy>=1.16.3 in /usr/local/lib/python3.6/dist-packages (from tpot) (1.18.4)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.2->tpot) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.2->tpot) (2.8.1)
Requirement already satisfied: requests>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from update-checker>=0.16->tpot) (2.23.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas>=0.24.2->tpot) (1.12.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2020.4.5.1)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2.9)
Building wheels for collected packages: stopit
Building wheel for stopit (setup.py) ... done
Created wheel for stopit: filename=stopit-1.1.2-cp36-none-any.whl size=11956 sha256=0f4ef274cedf5bacc6a4962ec3557d79ab6bae70783af582a93b404e650ccb32
Stored in directory: /root/.cache/pip/wheels/3c/85/2b/2580190404636bfc63e8de3dff629c03bb795021e1983a6cc7
Successfully built stopit
Installing collected packages: deap, stopit, update-checker, tpot
Successfully installed deap-1.3.1 stopit-1.1.2 tpot-0.11.2 update-checker-0.17
Generation 1 - Current best internal CV score: 0.9714285714285713
Generation 2 - Current best internal CV score: 0.9714285714285715
Generation 3 - Current best internal CV score: 0.9714285714285715
Generation 4 - Current best internal CV score: 0.9714285714285715
Generation 5 - Current best internal CV score: 0.9714285714285715
Best pipeline: KNeighborsClassifier(Nystroem(input_matrix, gamma=0.25, kernel=cosine, n_components=4), n_neighbors=20, p=1, weights=distance)
# Export the resulttpot.export('tpot_ml_pipeline.py')
TPOT will save the pipeline together with the entire code in a .py file which you can utilize for your task. You can even copy the code for the ML algorithm and utilize them if you don’t want to use the entire pipeline. You can check out the generated pipeline here
Let us try to use our pipeline to do some prediction.