In this fourth industrial age – the Age of AI, the knowledge and applications of machine learning is essential for every business and industry. It has a lot of potential and benefit when incorporated into your business . But not everyone has the time and the energy to learn the intricacies of Data Science and the various ML algorithms that are required to benefit from machine learning and Data Science.

This is where PyCaret comes to play. PyCaret is an open source simple to use python library for doing machine learning. It is quite easy yet powerful in the sense that you don’t need to know all the ML algorithms and the nitty gritty before creating a production ready model for your business.

PyCaret makes it easier for you. It acts as a wrapper around the most popular Machine Learning Libraries such as Scikit-learn,Xgboost, LightGB,etc. It also offers a simple API of functions that you can use to build and evaluate several models without much stress.

In this tutorial we will explore PyCaret and see how to use for predicting the mortality of Heart Failure among patients.

Installation

To install PyCaret you can use pip as below

pip install pycaret

So what can one do with PyCaret? With PyCaret you can easily do the following

Compare Models
Create Models
Tune Models
Evaluate Models
Interpret Models
Make predictions with the Model
Save and Load Models(Model Serialization)
Deploy Models

PyCaret is low-code but powerful – you just need a little code to do all the required activities. So to compare different models you just call compare_models() function and boom – you have a dataframe of several ML Models builts that you can select as you wish.

Let us start with the basic workflow

Workflow

Prepare Data
Initialize Setup
- Define the data and the target class
Compare Model
Create Model
- Select the one you want
Check accuracy of a selected model -predict
Tune model
Evaluate model
Interpret Model
Save model

We will be using our dataset from UCI and will be working inside Google’ s Colab -but you can also try it locally on your system.

Let us start

# Load EDA Pkgs
import pandas as pd

# Datasource
data_url = "https://raw.githubusercontent.com/Jcharis/data-science-projects/master/data-science-projects/notebooks/data/heart_failure_clinical_records_dataset.csv"

In [5]:

# Load Dataset
df = pd.read_csv(data_url)

In [6]:

df.head()

Out[6]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	DEATH_EVENT
0	75.0	0	582	0	20	1	265000.00	1.9	130	1	0	4	1
1	55.0	0	7861	0	38	0	263358.03	1.1	136	1	0	6	1
2	65.0	0	146	0	20	0	162000.00	1.3	129	1	1	7	1
3	50.0	1	111	0	20	0	210000.00	1.9	137	1	0	7	1
4	65.0	1	160	1	20	0	327000.00	2.7	116	0	0	8	1

In [7]:

# Shape
df.shape

Out[7]:

(299, 13)

In [8]:

# Check for missing values
df.isnull().sum()

Out[8]:

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

In [9]:

# Columns
df.columns

Out[9]:

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

In [10]:

# Rename A Column
df.rename(columns={'DEATH_EVENT':'class'},inplace=True)

In [11]:

df.columns

Out[11]:

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time', 'class'],
      dtype='object')

In [12]:

# Descriptive Stas
df.describe()

Out[12]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	class
count	299.000000	299.000000	299.000000	299.000000	299.000000	299.000000	299.000000	299.00000	299.000000	299.000000	299.00000	299.000000	299.00000
mean	60.833893	0.431438	581.839465	0.418060	38.083612	0.351171	263358.029264	1.39388	136.625418	0.648829	0.32107	130.260870	0.32107
std	11.894809	0.496107	970.287881	0.494067	11.834841	0.478136	97804.236869	1.03451	4.412477	0.478136	0.46767	77.614208	0.46767
min	40.000000	0.000000	23.000000	0.000000	14.000000	0.000000	25100.000000	0.50000	113.000000	0.000000	0.00000	4.000000	0.00000
25%	51.000000	0.000000	116.500000	0.000000	30.000000	0.000000	212500.000000	0.90000	134.000000	0.000000	0.00000	73.000000	0.00000
50%	60.000000	0.000000	250.000000	0.000000	38.000000	0.000000	262000.000000	1.10000	137.000000	1.000000	0.00000	115.000000	0.00000
75%	70.000000	1.000000	582.000000	1.000000	45.000000	1.000000	303500.000000	1.40000	140.000000	1.000000	1.00000	203.000000	1.00000
max	95.000000	1.000000	7861.000000	1.000000	80.000000	1.000000	850000.000000	9.40000	148.000000	1.000000	1.00000	285.000000	1.00000

In [13]:

# Value Count Plot
df['class'].value_counts()

Out[13]:

0    203
1     96
Name: class, dtype: int64

In [14]:

df['class'].value_counts().plot(kind='bar')

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2258bb7128>

Using PyCaret for ML

In [15]:

import pycaret.classification

In [ ]:

# Methods/Attrib
dir(pycaret.classification)

Out[ ]:

[‘X’, ‘X_test’, ‘X_train’, ‘__builtins__’, ‘__cached__’, ‘__doc__’, ‘__file__’, ‘__loader__’, ‘__name__’, ‘__package__’, ‘__spec__’, ‘blend_models’, ‘calibrate_model’, ‘compare_models’, ‘create_model’, ‘create_stacknet’, ‘deploy_model’, ‘ensemble_model’, ‘evaluate_model’, ‘experiment__’, ‘finalize_model’, ‘interpret_model’, ‘load_experiment’, ‘load_model’, ‘optimize_threshold’, ‘plot_model’, ‘predict_model’, ‘prep_pipe’, ‘save_experiment’, ‘save_model’, ‘seed’, ‘setup’, ‘stack_models’, ‘tune_model’, ‘y’, ‘y_test’, ‘y_train’]

In [16]:

# Simplify way
import pycaret.classification as pc

In [17]:

dir(pc)

Out[17]:

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'blend_models',
 'calibrate_model',
 'compare_models',
 'create_model',
 'create_stacknet',
 'deploy_model',
 'ensemble_model',
 'evaluate_model',
 'finalize_model',
 'interpret_model',
 'load_experiment',
 'load_model',
 'optimize_threshold',
 'plot_model',
 'predict_model',
 'save_experiment',
 'save_model',
 'setup',
 'stack_models',
 'tune_model']

Initialize or Setup

setup()
initializes the environment in pycaret
creates the transformation pipeline to prepare the data for ML
from pycaret.utils import enable_colab
enable_colab()

In [18]:

from pycaret.utils import enable_colab
enable_colab()

Colab mode activated.

In [21]:

# Init
# Enter Y for all variables in dataset
# Specify % of data for train/test/split
clf = pc.setup(data=df,target='class')

 
Setup Succesfully Completed!

	Description	Value
0	session_id	895
1	Target Type	Binary
2	Label Encoded	None
3	Original Data	(299, 13)
4	Missing Values	False
5	Numeric Features	6
6	Categorical Features	6
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(299, 13)
11	Transformed Train Set	(209, 28)
12	Transformed Test Set	(90, 28)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	False
16	Normalize Method	None
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	False
29	Multicollinearity Threshold	None
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None

In [20]:

# Ignore A Column
pc.setup(data=df,target='class',ignore_features=['age','diabetes'])

 
Setup Succesfully Completed!

	Description	Value
0	session_id	3879
1	Target Type	Binary
2	Label Encoded	None
3	Original Data	(299, 13)
4	Missing Values	False
5	Numeric Features	6
6	Categorical Features	6
7	Ordinal Features	False
8	High Cardinality Features	False
9	High Cardinality Method	None
10	Sampled Data	(299, 13)
11	Transformed Train Set	(209, 26)
12	Transformed Test Set	(90, 26)
13	Numeric Imputer	mean
14	Categorical Imputer	constant
15	Normalize	False
16	Normalize Method	None
17	Transformation	False
18	Transformation Method	None
19	PCA	False
20	PCA Method	None
21	PCA Components	None
22	Ignore Low Variance	False
23	Combine Rare Levels	False
24	Rare Level Threshold	None
25	Numeric Binning	False
26	Remove Outliers	False
27	Outliers Threshold	None
28	Remove Multicollinearity	False
29	Multicollinearity Threshold	None
30	Clustering	False
31	Clustering Iteration	None
32	Polynomial Features	False
33	Polynomial Degree	None
34	Trignometry Features	False
35	Polynomial Threshold	None
36	Group Features	False
37	Feature Selection	False
38	Features Selection Threshold	None
39	Feature Interaction	False
40	Feature Ratio	False
41	Interaction Threshold	None

Out[20]:

(     creatinine_phosphokinase  platelets  ...  sex_1  smoking_1
 0                       582.0  265000.00  ...    1.0        0.0
 1                      7861.0  263358.03  ...    1.0        0.0
 2                       146.0  162000.00  ...    1.0        1.0
 3                       111.0  210000.00  ...    1.0        0.0
 4                       160.0  327000.00  ...    0.0        0.0
 ..                        ...        ...  ...    ...        ...
 294                      61.0  155000.00  ...    1.0        1.0
 295                    1820.0  270000.00  ...    0.0        0.0
 296                    2060.0  742000.00  ...    0.0        0.0
 297                    2413.0  140000.00  ...    1.0        1.0
 298                     196.0  395000.00  ...    1.0        1.0
 
 [299 rows x 26 columns], 0      1
 1      1
 2      1
 3      1
 4      1
       ..
 294    0
 295    0
 296    0
 297    0
 298    0
 Name: class, Length: 299, dtype: int64,      creatinine_phosphokinase  platelets  ...  sex_1  smoking_1
 118                     113.0   203000.0  ...    0.0        0.0
 258                      66.0   233000.0  ...    1.0        0.0
 249                     207.0   223000.0  ...    0.0        0.0
 122                      96.0   228000.0  ...    0.0        0.0
 86                       47.0   173000.0  ...    1.0        0.0
 ..                        ...        ...  ...    ...        ...
 297                    2413.0   140000.0  ...    1.0        1.0
 80                       69.0   293000.0  ...    0.0        0.0
 261                     655.0   283000.0  ...    0.0        0.0
 141                     291.0   348000.0  ...    0.0        0.0
 77                      102.0   237000.0  ...    1.0        0.0
 
 [209 rows x 26 columns],      creatinine_phosphokinase  platelets  ...  sex_1  smoking_1
 246                    2017.0  314000.00  ...    1.0        0.0
 237                     232.0  173000.00  ...    1.0        0.0
 162                     582.0  448000.00  ...    1.0        1.0
 218                    1021.0  271000.00  ...    1.0        0.0
 154                     335.0  235000.00  ...    0.0        0.0
 ..                        ...        ...  ...    ...        ...
 107                    1876.0  226000.00  ...    1.0        0.0
 239                     180.0  263358.03  ...    1.0        1.0
 135                     582.0  263358.03  ...    1.0        0.0
 185                     104.0  389000.00  ...    1.0        0.0
 48                      553.0  140000.00  ...    1.0        0.0
 
 [90 rows x 26 columns], 118    0
 258    0
 249    0
 122    0
 86     0
       ..
 297    0
 80     0
 261    0
 141    0
 77     0
 Name: class, Length: 209, dtype: int64, 246    1
 237    0
 162    0
 218    0
 154    0
       ..
 107    0
 239    0
 135    0
 185    1
 48     1
 Name: class, Length: 90, dtype: int64, 3879, Pipeline(memory=None,
...

Compare Multiple Models and their Accuracy Metrics

Similar to Classification Report,AUC,F1 Score
For Classification Problems
- Classification report
- AUC,Recall,Precision,F1 score, Kappa
For Regression Problems
- MAE, MSE, RMSE, R2, RMSLE and MAPE

In [22]:

# Compare models
pc.compare_models()

Out[22]:

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa
0	Light Gradient Boosting Machine	0.852100	0.874000	0.745200	0.796900	0.764400	0.657300
1	CatBoost Classifier	0.832900	0.909200	0.688100	0.794500	0.726700	0.608000
2	Ada Boost Classifier	0.819000	0.835400	0.631000	0.763900	0.680400	0.558800
3	Ridge Classifier	0.818600	0.000000	0.635700	0.770500	0.683600	0.559200
4	Linear Discriminant Analysis	0.818600	0.895200	0.650000	0.759500	0.685900	0.561300
5	Extra Trees Classifier	0.818600	0.885800	0.609500	0.780000	0.682200	0.558000
6	Extreme Gradient Boosting	0.818600	0.909800	0.659500	0.767100	0.699000	0.571800
7	Gradient Boosting Classifier	0.814000	0.862100	0.676200	0.750800	0.697800	0.566900
8	Random Forest Classifier	0.804300	0.884500	0.583300	0.775600	0.650100	0.520200
9	Logistic Regression	0.794500	0.843500	0.611900	0.740100	0.661000	0.516700
10	Naive Bayes	0.785200	0.827300	0.552400	0.731700	0.622000	0.477400
11	Decision Tree Classifier	0.737600	0.696000	0.581000	0.637800	0.593900	0.403200
12	Quadratic Discriminant Analysis	0.703300	0.717300	0.119000	0.308300	0.164000	0.114600
13	K Neighbors Classifier	0.598100	0.523900	0.135700	0.345000	0.179400	-0.045800
14	SVM – Linear Kernel	0.536700	0.000000	0.400000	0.128600	0.194400	0.000000

In [ ]:

# Compare models
# But ignore the models in the blacklist
pc.compare_models(blacklist=['svm'])

Out[ ]:

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa
0	Extreme Gradient Boosting	0.842400	0.885400	0.707100	0.789800	0.742300	0.629900
1	Light Gradient Boosting Machine	0.837600	0.880500	0.704800	0.785800	0.733300	0.618900
2	CatBoost Classifier	0.832600	0.912200	0.650000	0.794300	0.702100	0.592300
3	Extra Trees Classifier	0.818300	0.866600	0.600000	0.772900	0.663000	0.547300
4	Random Forest Classifier	0.814000	0.881900	0.531000	0.840700	0.627100	0.520100
5	Gradient Boosting Classifier	0.804000	0.878100	0.647600	0.739900	0.682200	0.542200
6	Linear Discriminant Analysis	0.804000	0.883100	0.628600	0.753100	0.670900	0.534900
7	Ridge Classifier	0.799300	0.000000	0.628600	0.733700	0.666000	0.525200
8	Logistic Regression	0.794800	0.826400	0.566700	0.769500	0.640800	0.500900
9	Ada Boost Classifier	0.789800	0.837700	0.659500	0.685200	0.666800	0.514700
10	Decision Tree Classifier	0.770000	0.741200	0.659500	0.660200	0.651300	0.481400
11	Quadratic Discriminant Analysis	0.751700	0.758600	0.271400	0.696700	0.378600	0.294500
12	Naive Bayes	0.746400	0.799900	0.373800	0.680000	0.477000	0.333500
13	K Neighbors Classifier	0.568800	0.401400	0.071400	0.190800	0.093200	-0.135500

Narative

Pycaret builds a model using several algorithms and compare the best
It automatically sort them from the best accuracy to the least
It highlight the best model according to the classification report metrics

Creating the Model

Select the best model
cross validation
Perform CV K-Fold (10 default) for the selected model

In [23]:

!pip install neatutils

Collecting neatutils
  Downloading https://files.pythonhosted.org/packages/b5/cd/e48c790539aaa5cd31302d6d5e298e3db250b144af355c871908e9ec800b/neatutils-0.0.1-py3-none-any.whl
Collecting click-help-colors<0.9,>=0.8
  Downloading https://files.pythonhosted.org/packages/b1/57/10d5b653c2fb9a529459163126623b0d47c29653c95e4b8f0ee4bbc0cb5d/click_help_colors-0.8-py3-none-any.whl
Requirement already satisfied: click<8.0.0,>=7.1.2 in /usr/local/lib/python3.6/dist-packages (from neatutils) (7.1.2)
Installing collected packages: click-help-colors, neatutils
Successfully installed click-help-colors-0.8 neatutils-0.0.1

In [24]:

# Simple Tools to Get The Short/Abbrev for an Estimator/Ml Algorithm
#! pip install neatutils
import neatutils
neatutils.get_abbrev('Extreme Gradient Boosting')

Out[24]:

'xgboost'

In [25]:

# Create the model
xgboost_model = pc.create_model('xgboost')

	Accuracy	AUC	Recall	Prec.	F1	Kappa
0	0.8095	0.8556	0.6667	0.6667	0.6667	0.5333
1	0.8095	0.8444	0.8333	0.6250	0.7143	0.5758
2	0.8095	0.8673	0.5714	0.8000	0.6667	0.5385
3	0.7619	0.8776	0.7143	0.6250	0.6667	0.4828
4	0.7143	0.8469	0.5714	0.5714	0.5714	0.3571
5	0.9524	0.9490	0.8571	1.0000	0.9231	0.8889
6	0.7619	0.9286	0.4286	0.7500	0.5455	0.4000
7	0.8571	0.9796	0.7143	0.8333	0.7692	0.6667
8	0.8095	0.9490	0.5714	0.8000	0.6667	0.5385
9	0.9000	1.0000	0.6667	1.0000	0.8000	0.7368
Mean	0.8186	0.9098	0.6595	0.7671	0.6990	0.5718
SD	0.0661	0.0552	0.1233	0.1429	0.1047	0.1497

In [26]:

# LogReg Model
logreg_model = pc.create_model('lr')

	Accuracy	AUC	Recall	Prec.	F1	Kappa
0	0.8095	0.8222	0.6667	0.6667	0.6667	0.5333
1	0.6190	0.7556	0.5000	0.3750	0.4286	0.1515
2	0.7143	0.7551	0.5714	0.5714	0.5714	0.3571
3	0.7619	0.8367	0.5714	0.6667	0.6154	0.4444
4	0.7143	0.7755	0.5714	0.5714	0.5714	0.3571
5	0.9524	0.9898	0.8571	1.0000	0.9231	0.8889
6	0.9048	0.9592	0.7143	1.0000	0.8333	0.7692
7	0.8571	0.8673	0.5714	1.0000	0.7273	0.6400
8	0.7619	0.7449	0.4286	0.7500	0.5455	0.4000
9	0.8500	0.9286	0.6667	0.8000	0.7273	0.6250
Mean	0.7945	0.8435	0.6119	0.7401	0.6610	0.5167
SD	0.0949	0.0854	0.1137	0.2018	0.1388	0.2080

In [27]:

# Tune the Model
tuned_xgb = pc.tune_model('xgboost')

	Accuracy	AUC	Recall	Prec.	F1	Kappa
0	0.8571	0.8000	0.6667	0.8000	0.7273	0.6316
1	0.8095	0.8778	0.8333	0.6250	0.7143	0.5758
2	0.8095	0.8673	0.5714	0.8000	0.6667	0.5385
3	0.8571	0.8878	0.8571	0.7500	0.8000	0.6897
4	0.8095	0.8163	1.0000	0.6364	0.7778	0.6250
5	0.9524	0.9388	0.8571	1.0000	0.9231	0.8889
6	0.8095	0.8571	0.5714	0.8000	0.6667	0.5385
7	0.9048	1.0000	1.0000	0.7778	0.8750	0.8000
8	0.7619	0.8878	0.4286	0.7500	0.5455	0.4000
9	0.8500	0.9762	0.5000	1.0000	0.6667	0.5833
Mean	0.8421	0.8909	0.7286	0.7939	0.7363	0.6271
SD	0.0522	0.0610	0.1967	0.1194	0.1056	0.1320

In [28]:

print(xgboost_model)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=895,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=0)

In [29]:

print(tuned_xgb)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0,
              learning_rate=0.11, max_delta_step=0, max_depth=20,
              min_child_weight=4, missing=None, n_estimators=900, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=895,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=0)

In [30]:

# Optimize The Model 
tuned_xgb_optimized = pc.tune_model('xgboost',optimize='Accuracy')

	Accuracy	AUC	Recall	Prec.	F1	Kappa
0	0.8571	0.8000	0.6667	0.8000	0.7273	0.6316
1	0.8095	0.8778	0.8333	0.6250	0.7143	0.5758
2	0.8095	0.8673	0.5714	0.8000	0.6667	0.5385
3	0.8571	0.8878	0.8571	0.7500	0.8000	0.6897
4	0.8095	0.8163	1.0000	0.6364	0.7778	0.6250
5	0.9524	0.9388	0.8571	1.0000	0.9231	0.8889
6	0.8095	0.8571	0.5714	0.8000	0.6667	0.5385
7	0.9048	1.0000	1.0000	0.7778	0.8750	0.8000
8	0.7619	0.8878	0.4286	0.7500	0.5455	0.4000
9	0.8500	0.9762	0.5000	1.0000	0.6667	0.5833
Mean	0.8421	0.8909	0.7286	0.7939	0.7363	0.6271
SD	0.0522	0.0610	0.1967	0.1194	0.1056	0.1320

In [31]:

###  Evaluate the Model
pc.evaluate_model(tuned_xgb)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

In [32]:

# Plot Performance of Model
pc.plot_model(tuned_xgb)

In [33]:

# Plot Prediction Error of Model
pc.plot_model(tuned_xgb,plot='error')

In [34]:

# Plot Confusion Matrix
pc.plot_model(tuned_xgb,plot='confusion_matrix')

In [35]:

# Feature Importance
pc.plot_model(tuned_xgb,plot='feature')

In [36]:

# Validation Curve
pc.plot_model(tuned_xgb,plot='vc')

In [37]:

# optimize threshold for trained model
pc.optimize_threshold(tuned_xgb, true_negative = 1500, false_negative = -5000)

Optimized Probability Threshold: 0.11 | Optimized Cost Function: 45000

In [ ]:

# Save Models
pc.save_model(tuned_xgb,'xgb_saved_model_02072020')

Transformation Pipeline and Model Succesfully Saved

In [ ]:

# Loading the saved model
loaded_model = pc.load_model('xgb_saved_model_02072020')

In [38]:

# Interpret Model
pc.interpret_model(tuned_xgb)

In [70]:

# Finalize Model For Prediction
final_xgb_model = pc.finalize_model(tuned_xgb)

Making a Single Prediction with PyCaret

It is required that you use a dataframe when making your prediction. The predict_model() takes a dataframe and the model that you have built and then returns a dataframe consisting of the unseen data with two additional columns : one for the prediction label and the other for the accuracy score(probability score) for that prediction.

Hence to make a prediction you need to supply your unseen data as a DataFrame.

In [ ]:

## Making A Simple Prediction with PyCaret
#### Create A Dataframe
#### Dictionary(columns_name:values)

In [71]:

# Method 1
df.iloc[1]

Out[71]:

age                             55.00
anaemia                          0.00
creatinine_phosphokinase      7861.00
diabetes                         0.00
ejection_fraction               38.00
high_blood_pressure              0.00
platelets                   263358.03
serum_creatinine                 1.10
serum_sodium                   136.00
sex                              1.00
smoking                          0.00
time                             6.00
class                            1.00
Name: 1, dtype: float64

In [72]:

df.iloc[[1]]

Out[72]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	class
1	55.0	0	7861	0	38	0	263358.03	1.1	136	1	0	6	1

In [73]:

unseen_data = df.iloc[[1],:-1]

In [74]:

unseen_data

Out[74]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time
1	55.0	0	7861	0	38	0	263358.03	1.1	136	1	0	6

In [75]:

type(unseen_data)

Out[75]:

pandas.core.frame.DataFrame

In [76]:

# Predict with Model
prediction = pc.predict_model(final_xgb_model,data=unseen_data)

In [77]:

prediction

Out[77]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	Label	Score
0	55.0	0	7861	0	38	0	263358.03	1.1	136	1	0	6	1	0.9858

In [78]:

# Method 2 (Dict=>Df)
df.columns.tolist()

Out[78]:

[‘age’, ‘anaemia’, ‘creatinine_phosphokinase’, ‘diabetes’, ‘ejection_fraction’, ‘high_blood_pressure’, ‘platelets’, ‘serum_creatinine’, ‘serum_sodium’, ‘sex’, ‘smoking’, ‘time’, ‘class’]

In [79]:

col_names = ['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']

In [81]:

list(df.iloc[1])

Out[81]:

[55.0, 0.0, 7861.0, 0.0, 38.0, 0.0, 263358.03, 1.1, 136.0, 1.0, 0.0, 6.0, 1.0]

In [82]:

sample_values = [55.0, 0.0, 7861.0, 0.0, 38.0, 0.0, 263358.03, 1.1, 136.0, 1.0, 0.0, 6.0]

In [83]:

d = dict(zip(col_names,sample_values))

In [84]:

print(d)

Out[84]:

{'age': 55.0,
 'anaemia': 0.0,
 'creatinine_phosphokinase': 7861.0,
 'diabetes': 0.0,
 'ejection_fraction': 38.0,
 'high_blood_pressure': 0.0,
 'platelets': 263358.03,
 'serum_creatinine': 1.1,
 'serum_sodium': 136.0,
 'sex': 1.0,
 'smoking': 0.0,
 'time': 6.0}

In [85]:

unseen_data2 = pd.DataFrame([d])

In [86]:

unseen_data2

Out[86]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time
0	55.0	0.0	7861.0	0.0	38.0	0.0	263358.03	1.1	136.0	1.0	0.0	6.0

In [87]:

# Predict with Model
prediction2 = pc.predict_model(final_xgb_model,data=unseen_data2)

In [88]:

prediction2

Out[88]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	high_blood_pressure	platelets	serum_creatinine	serum_sodium	sex	smoking	time	Label	Score
0	55.0	0.0	7861.0	0.0	38.0	0.0	263358.03	1.1	136.0	1.0	0.0	6.0	1	0.9558

You can also check out the video tutorial below.

So to conclude we have seen how simple but powerful pycaret is. It makes our work quite easier. Thanks for your time

Thanks For Your Attention

Jesus saves

By Jesse E.Agbe(JCharis)

Machine Learning with PyCaret in Python

Workflow

Using PyCaret for ML

Initialize or Setup

Compare Multiple Models and their Accuracy Metrics

Narative

Creating the Model

Making a Single Prediction with PyCaret

Leave a Comment Cancel Reply