Machine Learning with PyCaret in Python

In this fourth industrial age – the Age of AI, the knowledge and applications of machine learning is essential for every business and industry. It has a lot of potential and benefit when incorporated into your business . But not everyone has the time and the energy to learn the intricacies of Data Science and the various ML algorithms that are required to benefit from machine learning and Data Science.

This is where PyCaret comes to play. PyCaret is an open source simple to use python library for doing machine learning. It is quite easy yet powerful in the sense that you don’t need to know all the ML algorithms and the nitty gritty before creating a production ready model for your business.

PyCaret makes it easier for you. It acts as a wrapper around the most popular Machine Learning Libraries such as Scikit-learn,Xgboost, LightGB,etc. It also offers a simple API of functions that you can use to build and evaluate several models without much stress.

In this tutorial we will explore PyCaret and see how to use for predicting the mortality of Heart Failure among patients.

Installation

To install PyCaret you can use pip as below

pip install pycaret

So what can one do with PyCaret? With PyCaret you can easily do the following

  • Compare Models
  • Create Models
  • Tune Models
  • Evaluate Models
  • Interpret Models
  • Make predictions with the Model
  • Save and Load Models(Model Serialization)
  • Deploy Models

PyCaret is low-code but powerful – you just need a little code to do all the required activities. So to compare different models you just call compare_models() function and boom – you have a dataframe of several ML Models builts that you can select as you wish.

Let us start with the basic workflow

Workflow

  • Prepare Data
  • Initialize Setup
    • Define the data and the target class
  • Compare Model
  • Create Model
    • Select the one you want
  • Check accuracy of a selected model -predict
  • Tune model
  • Evaluate model
  • Interpret Model
  • Save model

We will be using our dataset from UCI  and will be working inside Google’ s Colab -but you can also try it locally on your system.

Let us start

# Load EDA Pkgs
import pandas as pd
# Datasource
data_url = "https://raw.githubusercontent.com/Jcharis/data-science-projects/master/data-science-projects/notebooks/data/heart_failure_clinical_records_dataset.csv"
In [5]:
# Load Dataset
df = pd.read_csv(data_url)
In [6]:
df.head()
Out[6]:
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1
1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1
2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1
3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1
4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1
In [7]:
# Shape
df.shape
Out[7]:
(299, 13)
In [8]:
# Check for missing values
df.isnull().sum()
Out[8]:
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64
In [9]:
# Columns
df.columns
Out[9]:
Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')
In [10]:
# Rename A Column
df.rename(columns={'DEATH_EVENT':'class'},inplace=True)
In [11]:
df.columns
Out[11]:
Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time', 'class'],
      dtype='object')
In [12]:
# Descriptive Stas
df.describe()
Out[12]:
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time class
count 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.00000 299.000000 299.000000 299.00000 299.000000 299.00000
mean 60.833893 0.431438 581.839465 0.418060 38.083612 0.351171 263358.029264 1.39388 136.625418 0.648829 0.32107 130.260870 0.32107
std 11.894809 0.496107 970.287881 0.494067 11.834841 0.478136 97804.236869 1.03451 4.412477 0.478136 0.46767 77.614208 0.46767
min 40.000000 0.000000 23.000000 0.000000 14.000000 0.000000 25100.000000 0.50000 113.000000 0.000000 0.00000 4.000000 0.00000
25% 51.000000 0.000000 116.500000 0.000000 30.000000 0.000000 212500.000000 0.90000 134.000000 0.000000 0.00000 73.000000 0.00000
50% 60.000000 0.000000 250.000000 0.000000 38.000000 0.000000 262000.000000 1.10000 137.000000 1.000000 0.00000 115.000000 0.00000
75% 70.000000 1.000000 582.000000 1.000000 45.000000 1.000000 303500.000000 1.40000 140.000000 1.000000 1.00000 203.000000 1.00000
max 95.000000 1.000000 7861.000000 1.000000 80.000000 1.000000 850000.000000 9.40000 148.000000 1.000000 1.00000 285.000000 1.00000
In [13]:
# Value Count Plot
df['class'].value_counts()
Out[13]:
0    203
1     96
Name: class, dtype: int64
In [14]:
df['class'].value_counts().plot(kind='bar')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2258bb7128>

Using PyCaret for ML

In [15]:
import pycaret.classification
In [ ]:
# Methods/Attrib
dir(pycaret.classification)
Out[ ]:
[‘X’, ‘X_test’, ‘X_train’, ‘__builtins__’, ‘__cached__’, ‘__doc__’, ‘__file__’, ‘__loader__’, ‘__name__’, ‘__package__’, ‘__spec__’, ‘blend_models’, ‘calibrate_model’, ‘compare_models’, ‘create_model’, ‘create_stacknet’, ‘deploy_model’, ‘ensemble_model’, ‘evaluate_model’, ‘experiment__’, ‘finalize_model’, ‘interpret_model’, ‘load_experiment’, ‘load_model’, ‘optimize_threshold’, ‘plot_model’, ‘predict_model’, ‘prep_pipe’, ‘save_experiment’, ‘save_model’, ‘seed’, ‘setup’, ‘stack_models’, ‘tune_model’, ‘y’, ‘y_test’, ‘y_train’]
In [16]:
# Simplify way
import pycaret.classification as pc
In [17]:
dir(pc)
Out[17]:
['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'blend_models',
 'calibrate_model',
 'compare_models',
 'create_model',
 'create_stacknet',
 'deploy_model',
 'ensemble_model',
 'evaluate_model',
 'finalize_model',
 'interpret_model',
 'load_experiment',
 'load_model',
 'optimize_threshold',
 'plot_model',
 'predict_model',
 'save_experiment',
 'save_model',
 'setup',
 'stack_models',
 'tune_model']

Initialize or Setup

  • setup()
  • initializes the environment in pycaret
  • creates the transformation pipeline to prepare the data for ML
  • from pycaret.utils import enable_colab
  • enable_colab()
In [18]:
from pycaret.utils import enable_colab
enable_colab()
Colab mode activated.
In [21]:
# Init
# Enter Y for all variables in dataset
# Specify % of data for train/test/split
clf = pc.setup(data=df,target='class')
 
Setup Succesfully Completed!
Description Value
0 session_id 895
1 Target Type Binary
2 Label Encoded None
3 Original Data (299, 13)
4 Missing Values False
5 Numeric Features 6
6 Categorical Features 6
7 Ordinal Features False
8 High Cardinality Features False
9 High Cardinality Method None
10 Sampled Data (299, 13)
11 Transformed Train Set (209, 28)
12 Transformed Test Set (90, 28)
13 Numeric Imputer mean
14 Categorical Imputer constant
15 Normalize False
16 Normalize Method None
17 Transformation False
18 Transformation Method None
19 PCA False
20 PCA Method None
21 PCA Components None
22 Ignore Low Variance False
23 Combine Rare Levels False
24 Rare Level Threshold None
25 Numeric Binning False
26 Remove Outliers False
27 Outliers Threshold None
28 Remove Multicollinearity False
29 Multicollinearity Threshold None
30 Clustering False
31 Clustering Iteration None
32 Polynomial Features False
33 Polynomial Degree None
34 Trignometry Features False
35 Polynomial Threshold None
36 Group Features False
37 Feature Selection False
38 Features Selection Threshold None
39 Feature Interaction False
40 Feature Ratio False
41 Interaction Threshold None
In [20]:
# Ignore A Column
pc.setup(data=df,target='class',ignore_features=['age','diabetes'])
 
Setup Succesfully Completed!
Description Value
0 session_id 3879
1 Target Type Binary
2 Label Encoded None
3 Original Data (299, 13)
4 Missing Values False
5 Numeric Features 6
6 Categorical Features 6
7 Ordinal Features False
8 High Cardinality Features False
9 High Cardinality Method None
10 Sampled Data (299, 13)
11 Transformed Train Set (209, 26)
12 Transformed Test Set (90, 26)
13 Numeric Imputer mean
14 Categorical Imputer constant
15 Normalize False
16 Normalize Method None
17 Transformation False
18 Transformation Method None
19 PCA False
20 PCA Method None
21 PCA Components None
22 Ignore Low Variance False
23 Combine Rare Levels False
24 Rare Level Threshold None
25 Numeric Binning False
26 Remove Outliers False
27 Outliers Threshold None
28 Remove Multicollinearity False
29 Multicollinearity Threshold None
30 Clustering False
31 Clustering Iteration None
32 Polynomial Features False
33 Polynomial Degree None
34 Trignometry Features False
35 Polynomial Threshold None
36 Group Features False
37 Feature Selection False
38 Features Selection Threshold None
39 Feature Interaction False
40 Feature Ratio False
41 Interaction Threshold None
Out[20]:
(     creatinine_phosphokinase  platelets  ...  sex_1  smoking_1
 0                       582.0  265000.00  ...    1.0        0.0
 1                      7861.0  263358.03  ...    1.0        0.0
 2                       146.0  162000.00  ...    1.0        1.0
 3                       111.0  210000.00  ...    1.0        0.0
 4                       160.0  327000.00  ...    0.0        0.0
 ..                        ...        ...  ...    ...        ...
 294                      61.0  155000.00  ...    1.0        1.0
 295                    1820.0  270000.00  ...    0.0        0.0
 296                    2060.0  742000.00  ...    0.0        0.0
 297                    2413.0  140000.00  ...    1.0        1.0
 298                     196.0  395000.00  ...    1.0        1.0
 
 [299 rows x 26 columns], 0      1
 1      1
 2      1
 3      1
 4      1
       ..
 294    0
 295    0
 296    0
 297    0
 298    0
 Name: class, Length: 299, dtype: int64,      creatinine_phosphokinase  platelets  ...  sex_1  smoking_1
 118                     113.0   203000.0  ...    0.0        0.0
 258                      66.0   233000.0  ...    1.0        0.0
 249                     207.0   223000.0  ...    0.0        0.0
 122                      96.0   228000.0  ...    0.0        0.0
 86                       47.0   173000.0  ...    1.0        0.0
 ..                        ...        ...  ...    ...        ...
 297                    2413.0   140000.0  ...    1.0        1.0
 80                       69.0   293000.0  ...    0.0        0.0
 261                     655.0   283000.0  ...    0.0        0.0
 141                     291.0   348000.0  ...    0.0        0.0
 77                      102.0   237000.0  ...    1.0        0.0
 
 [209 rows x 26 columns],      creatinine_phosphokinase  platelets  ...  sex_1  smoking_1
 246                    2017.0  314000.00  ...    1.0        0.0
 237                     232.0  173000.00  ...    1.0        0.0
 162                     582.0  448000.00  ...    1.0        1.0
 218                    1021.0  271000.00  ...    1.0        0.0
 154                     335.0  235000.00  ...    0.0        0.0
 ..                        ...        ...  ...    ...        ...
 107                    1876.0  226000.00  ...    1.0        0.0
 239                     180.0  263358.03  ...    1.0        1.0
 135                     582.0  263358.03  ...    1.0        0.0
 185                     104.0  389000.00  ...    1.0        0.0
 48                      553.0  140000.00  ...    1.0        0.0
 
 [90 rows x 26 columns], 118    0
 258    0
 249    0
 122    0
 86     0
       ..
 297    0
 80     0
 261    0
 141    0
 77     0
 Name: class, Length: 209, dtype: int64, 246    1
 237    0
 162    0
 218    0
 154    0
       ..
 107    0
 239    0
 135    0
 185    1
 48     1
 Name: class, Length: 90, dtype: int64, 3879, Pipeline(memory=None,
...

Compare Multiple Models and their Accuracy Metrics

  • Similar to Classification Report,AUC,F1 Score
  • For Classification Problems
    • Classification report
    • AUC,Recall,Precision,F1 score, Kappa
  • For Regression Problems
    • MAE, MSE, RMSE, R2, RMSLE and MAPE
In [22]:
# Compare models
pc.compare_models()
Out[22]:
Model Accuracy AUC Recall Prec. F1 Kappa
0 Light Gradient Boosting Machine 0.852100 0.874000 0.745200 0.796900 0.764400 0.657300
1 CatBoost Classifier 0.832900 0.909200 0.688100 0.794500 0.726700 0.608000
2 Ada Boost Classifier 0.819000 0.835400 0.631000 0.763900 0.680400 0.558800
3 Ridge Classifier 0.818600 0.000000 0.635700 0.770500 0.683600 0.559200
4 Linear Discriminant Analysis 0.818600 0.895200 0.650000 0.759500 0.685900 0.561300
5 Extra Trees Classifier 0.818600 0.885800 0.609500 0.780000 0.682200 0.558000
6 Extreme Gradient Boosting 0.818600 0.909800 0.659500 0.767100 0.699000 0.571800
7 Gradient Boosting Classifier 0.814000 0.862100 0.676200 0.750800 0.697800 0.566900
8 Random Forest Classifier 0.804300 0.884500 0.583300 0.775600 0.650100 0.520200
9 Logistic Regression 0.794500 0.843500 0.611900 0.740100 0.661000 0.516700
10 Naive Bayes 0.785200 0.827300 0.552400 0.731700 0.622000 0.477400
11 Decision Tree Classifier 0.737600 0.696000 0.581000 0.637800 0.593900 0.403200
12 Quadratic Discriminant Analysis 0.703300 0.717300 0.119000 0.308300 0.164000 0.114600
13 K Neighbors Classifier 0.598100 0.523900 0.135700 0.345000 0.179400 -0.045800
14 SVM – Linear Kernel 0.536700 0.000000 0.400000 0.128600 0.194400 0.000000
In [ ]:
# Compare models
# But ignore the models in the blacklist
pc.compare_models(blacklist=['svm'])
Out[ ]:
Model Accuracy AUC Recall Prec. F1 Kappa
0 Extreme Gradient Boosting 0.842400 0.885400 0.707100 0.789800 0.742300 0.629900
1 Light Gradient Boosting Machine 0.837600 0.880500 0.704800 0.785800 0.733300 0.618900
2 CatBoost Classifier 0.832600 0.912200 0.650000 0.794300 0.702100 0.592300
3 Extra Trees Classifier 0.818300 0.866600 0.600000 0.772900 0.663000 0.547300
4 Random Forest Classifier 0.814000 0.881900 0.531000 0.840700 0.627100 0.520100
5 Gradient Boosting Classifier 0.804000 0.878100 0.647600 0.739900 0.682200 0.542200
6 Linear Discriminant Analysis 0.804000 0.883100 0.628600 0.753100 0.670900 0.534900
7 Ridge Classifier 0.799300 0.000000 0.628600 0.733700 0.666000 0.525200
8 Logistic Regression 0.794800 0.826400 0.566700 0.769500 0.640800 0.500900
9 Ada Boost Classifier 0.789800 0.837700 0.659500 0.685200 0.666800 0.514700
10 Decision Tree Classifier 0.770000 0.741200 0.659500 0.660200 0.651300 0.481400
11 Quadratic Discriminant Analysis 0.751700 0.758600 0.271400 0.696700 0.378600 0.294500
12 Naive Bayes 0.746400 0.799900 0.373800 0.680000 0.477000 0.333500
13 K Neighbors Classifier 0.568800 0.401400 0.071400 0.190800 0.093200 -0.135500

Narative

  • Pycaret builds a model using several algorithms and compare the best
  • It automatically sort them from the best accuracy to the least
  • It highlight the best model according to the classification report metrics

Creating the Model

  • Select the best model
  • cross validation
  • Perform CV K-Fold (10 default) for the selected model
In [23]:
!pip install neatutils
Collecting neatutils
  Downloading https://files.pythonhosted.org/packages/b5/cd/e48c790539aaa5cd31302d6d5e298e3db250b144af355c871908e9ec800b/neatutils-0.0.1-py3-none-any.whl
Collecting click-help-colors<0.9,>=0.8
  Downloading https://files.pythonhosted.org/packages/b1/57/10d5b653c2fb9a529459163126623b0d47c29653c95e4b8f0ee4bbc0cb5d/click_help_colors-0.8-py3-none-any.whl
Requirement already satisfied: click<8.0.0,>=7.1.2 in /usr/local/lib/python3.6/dist-packages (from neatutils) (7.1.2)
Installing collected packages: click-help-colors, neatutils
Successfully installed click-help-colors-0.8 neatutils-0.0.1
In [24]:
# Simple Tools to Get The Short/Abbrev for an Estimator/Ml Algorithm
#! pip install neatutils
import neatutils
neatutils.get_abbrev('Extreme Gradient Boosting')
Out[24]:
'xgboost'
In [25]:
# Create the model
xgboost_model = pc.create_model('xgboost')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8095 0.8556 0.6667 0.6667 0.6667 0.5333
1 0.8095 0.8444 0.8333 0.6250 0.7143 0.5758
2 0.8095 0.8673 0.5714 0.8000 0.6667 0.5385
3 0.7619 0.8776 0.7143 0.6250 0.6667 0.4828
4 0.7143 0.8469 0.5714 0.5714 0.5714 0.3571
5 0.9524 0.9490 0.8571 1.0000 0.9231 0.8889
6 0.7619 0.9286 0.4286 0.7500 0.5455 0.4000
7 0.8571 0.9796 0.7143 0.8333 0.7692 0.6667
8 0.8095 0.9490 0.5714 0.8000 0.6667 0.5385
9 0.9000 1.0000 0.6667 1.0000 0.8000 0.7368
Mean 0.8186 0.9098 0.6595 0.7671 0.6990 0.5718
SD 0.0661 0.0552 0.1233 0.1429 0.1047 0.1497
In [26]:
# LogReg Model
logreg_model = pc.create_model('lr')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8095 0.8222 0.6667 0.6667 0.6667 0.5333
1 0.6190 0.7556 0.5000 0.3750 0.4286 0.1515
2 0.7143 0.7551 0.5714 0.5714 0.5714 0.3571
3 0.7619 0.8367 0.5714 0.6667 0.6154 0.4444
4 0.7143 0.7755 0.5714 0.5714 0.5714 0.3571
5 0.9524 0.9898 0.8571 1.0000 0.9231 0.8889
6 0.9048 0.9592 0.7143 1.0000 0.8333 0.7692
7 0.8571 0.8673 0.5714 1.0000 0.7273 0.6400
8 0.7619 0.7449 0.4286 0.7500 0.5455 0.4000
9 0.8500 0.9286 0.6667 0.8000 0.7273 0.6250
Mean 0.7945 0.8435 0.6119 0.7401 0.6610 0.5167
SD 0.0949 0.0854 0.1137 0.2018 0.1388 0.2080
In [27]:
# Tune the Model
tuned_xgb = pc.tune_model('xgboost')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8571 0.8000 0.6667 0.8000 0.7273 0.6316
1 0.8095 0.8778 0.8333 0.6250 0.7143 0.5758
2 0.8095 0.8673 0.5714 0.8000 0.6667 0.5385
3 0.8571 0.8878 0.8571 0.7500 0.8000 0.6897
4 0.8095 0.8163 1.0000 0.6364 0.7778 0.6250
5 0.9524 0.9388 0.8571 1.0000 0.9231 0.8889
6 0.8095 0.8571 0.5714 0.8000 0.6667 0.5385
7 0.9048 1.0000 1.0000 0.7778 0.8750 0.8000
8 0.7619 0.8878 0.4286 0.7500 0.5455 0.4000
9 0.8500 0.9762 0.5000 1.0000 0.6667 0.5833
Mean 0.8421 0.8909 0.7286 0.7939 0.7363 0.6271
SD 0.0522 0.0610 0.1967 0.1194 0.1056 0.1320
In [28]:
print(xgboost_model)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=895,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=0)
In [29]:
print(tuned_xgb)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0,
              learning_rate=0.11, max_delta_step=0, max_depth=20,
              min_child_weight=4, missing=None, n_estimators=900, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=895,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=0)
In [30]:
# Optimize The Model 
tuned_xgb_optimized = pc.tune_model('xgboost',optimize='Accuracy')
Accuracy AUC Recall Prec. F1 Kappa
0 0.8571 0.8000 0.6667 0.8000 0.7273 0.6316
1 0.8095 0.8778 0.8333 0.6250 0.7143 0.5758
2 0.8095 0.8673 0.5714 0.8000 0.6667 0.5385
3 0.8571 0.8878 0.8571 0.7500 0.8000 0.6897
4 0.8095 0.8163 1.0000 0.6364 0.7778 0.6250
5 0.9524 0.9388 0.8571 1.0000 0.9231 0.8889
6 0.8095 0.8571 0.5714 0.8000 0.6667 0.5385
7 0.9048 1.0000 1.0000 0.7778 0.8750 0.8000
8 0.7619 0.8878 0.4286 0.7500 0.5455 0.4000
9 0.8500 0.9762 0.5000 1.0000 0.6667 0.5833
Mean 0.8421 0.8909 0.7286 0.7939 0.7363 0.6271
SD 0.0522 0.0610 0.1967 0.1194 0.1056 0.1320
In [31]:
###  Evaluate the Model
pc.evaluate_model(tuned_xgb)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…
In [32]:
# Plot Performance of Model
pc.plot_model(tuned_xgb)
In [33]:
# Plot Prediction Error of Model
pc.plot_model(tuned_xgb,plot='error')
In [34]:
# Plot Confusion Matrix
pc.plot_model(tuned_xgb,plot='confusion_matrix')
In [35]:
# Feature Importance
pc.plot_model(tuned_xgb,plot='feature')
In [36]:
# Validation Curve
pc.plot_model(tuned_xgb,plot='vc')
In [37]:
# optimize threshold for trained model
pc.optimize_threshold(tuned_xgb, true_negative = 1500, false_negative = -5000)
Optimized Probability Threshold: 0.11 | Optimized Cost Function: 45000
In [ ]:
# Save Models
pc.save_model(tuned_xgb,'xgb_saved_model_02072020')
Transformation Pipeline and Model Succesfully Saved
In [ ]:
# Loading the saved model
loaded_model = pc.load_model('xgb_saved_model_02072020')
In [38]:
# Interpret Model
pc.interpret_model(tuned_xgb)
In [70]:
# Finalize Model For Prediction
final_xgb_model = pc.finalize_model(tuned_xgb)
Making a Single Prediction with PyCaret
It is required that you use a dataframe when making your prediction. The predict_model() takes a dataframe and the model that you have built and then returns a dataframe consisting of the unseen data with two additional columns : one for the prediction label and the other for the accuracy score(probability score) for that prediction.
Hence to make a prediction you need to supply your unseen data as a DataFrame.
In [ ]:
## Making A Simple Prediction with PyCaret
#### Create A Dataframe
#### Dictionary(columns_name:values)
In [71]:
# Method 1
df.iloc[1]
Out[71]:
age                             55.00
anaemia                          0.00
creatinine_phosphokinase      7861.00
diabetes                         0.00
ejection_fraction               38.00
high_blood_pressure              0.00
platelets                   263358.03
serum_creatinine                 1.10
serum_sodium                   136.00
sex                              1.00
smoking                          0.00
time                             6.00
class                            1.00
Name: 1, dtype: float64
In [72]:
df.iloc[[1]]
Out[72]:
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time class
1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1
In [73]:
unseen_data = df.iloc[[1],:-1]
In [74]:
unseen_data
Out[74]:
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6
In [75]:
type(unseen_data)
Out[75]:
pandas.core.frame.DataFrame
In [76]:
# Predict with Model
prediction = pc.predict_model(final_xgb_model,data=unseen_data)
In [77]:
prediction
Out[77]:
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time Label Score
0 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1 0.9858
In [78]:
# Method 2 (Dict=>Df)
df.columns.tolist()
Out[78]:
[‘age’, ‘anaemia’, ‘creatinine_phosphokinase’, ‘diabetes’, ‘ejection_fraction’, ‘high_blood_pressure’, ‘platelets’, ‘serum_creatinine’, ‘serum_sodium’, ‘sex’, ‘smoking’, ‘time’, ‘class’]
In [79]:
col_names = ['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']
In [81]:
list(df.iloc[1])
Out[81]:
[55.0, 0.0, 7861.0, 0.0, 38.0, 0.0, 263358.03, 1.1, 136.0, 1.0, 0.0, 6.0, 1.0]
In [82]:
sample_values = [55.0, 0.0, 7861.0, 0.0, 38.0, 0.0, 263358.03, 1.1, 136.0, 1.0, 0.0, 6.0]
In [83]:
d = dict(zip(col_names,sample_values))
In [84]:
print(d)
Out[84]:
{'age': 55.0,
 'anaemia': 0.0,
 'creatinine_phosphokinase': 7861.0,
 'diabetes': 0.0,
 'ejection_fraction': 38.0,
 'high_blood_pressure': 0.0,
 'platelets': 263358.03,
 'serum_creatinine': 1.1,
 'serum_sodium': 136.0,
 'sex': 1.0,
 'smoking': 0.0,
 'time': 6.0}
In [85]:
unseen_data2 = pd.DataFrame([d])
In [86]:
unseen_data2
Out[86]:
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
0 55.0 0.0 7861.0 0.0 38.0 0.0 263358.03 1.1 136.0 1.0 0.0 6.0
In [87]:
# Predict with Model
prediction2 = pc.predict_model(final_xgb_model,data=unseen_data2)
In [88]:
prediction2
Out[88]:
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time Label Score
0 55.0 0.0 7861.0 0.0 38.0 0.0 263358.03 1.1 136.0 1.0 0.0 6.0 1 0.9558

You can also check out the video tutorial below.

So to conclude we have seen how simple but powerful pycaret is. It makes our work quite easier. Thanks for your time

Thanks For Your Attention

Jesus saves

By Jesse E.Agbe(JCharis)

 

Leave a Comment

Your email address will not be published. Required fields are marked *