In this tutorial we will see how to classify text/document using machine learning and then move on to interpret our classification model with Eli5.

This is the workflow we will be using in this project.

Text Preprocessing
- We will be using spacy and basic python to preprocess our documents to get a clean dataset
- We will remove all stop words and build a tokenizer and a couple of lemmas.
- This is to help improve our dataset which we will feed into our model.

Text Classification and Model Building
- Since we are working with text document we will be using countvectorizer or TFID to help in our vectorization
- Vectorization here means we are converting our text into numbers so that our ML will be able to understand it.
- We will then build our model using algorithms from scikitlearn package
- You can choose any algorithm as you like.

Prediction
- We will then use our trained model to do some predictions
- We calculate the accuracy of our model

Interpretation of Our Machine Learning Model
- The next step involves using Eli5 to help interpret and explain why our ML model gave us such a prediction
- We will then check for how trustworthy our interpretation is with Eli5’s metrics

Let us check the code

In [1]:

# load EDA Pkgs
import pandas as pd
import numpy as np

In [2]:

# Load NLP pkgs
import spacy

In [3]:

from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_sm')

In [4]:

# Use the punctuations of string module
import string
punctuations = string.punctuation

In [5]:

# Creating a Spacy Parser
from spacy.lang.en import English
parser = English()

In [9]:

# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)

In [10]:

stopwords

In [11]:

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    return mytokens

In [12]:

ex1 = "He was walking with the walker in the Wall he may had sat and run with the runner"

In [13]:

spacy_tokenizer(ex1)

Out[13]:

['walk', 'walker', 'wall', 'sit', 'run', 'runner']

In [14]:

# Load ML Pkgs
# ML Packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

In [15]:

# Load Interpretation Pkgs
import eli5

In [16]:

# Load dataset
df = pd.read_csv("sentimentdataset.csv")

In [17]:

df.head()

In [18]:

df.shape

Out[18]:

(2745, 4)

In [19]:

df.columns

Out[19]:

Index(['Unnamed: 0', 'Unnamed: 1', 'Message', 'Target'], dtype='object')

In [20]:

#Custom transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

In [21]:

# Basic function to clean the text 
def clean_text(text):     
    return text.strip().lower()

In [22]:

# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) 
# classifier = LinearSVC()
classifier = SVC(C=150, gamma=2e-2, probability=True)

In [23]:

# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [24]:

# Splitting Data Set
from sklearn.model_selection import train_test_split

In [25]:

# Features and Labels
X = df['Message']
ylabels = df['Target']

In [26]:

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)

In [27]:

X_train

In [28]:

X_train.shape

Out[28]:

(2196,)

In [29]:

# Create the  pipeline to clean, tokenize, vectorize, and classify 
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

In [30]:

# Fit our data
pipe.fit(X_train,y_train)

Out[30]:

Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x000002CEDE98C9B0>), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
      ...',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [31]:

X_test.shape

Out[31]:

(549,)

In [32]:

X_test.values[1]

Out[32]:

'It is a true classic.  '

In [33]:

# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)

In [35]:

print("Accuracy Score:",pipe.score(X_test, y_test))

Accuracy Score: 0.7540983606557377

In [36]:

# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample,pred) in zip(X_test,sample_prediction):
    print(sample,"Prediction=>",pred)

Great pork sandwich. Prediction=> 1
It is a true classic.   Prediction=> 0
It's close to my house, it's low-key, non-fancy, affordable prices, good food. Prediction=> 0
Audio Quality is poor, very poor. Prediction=> 0
We loved the biscuits!!! Prediction=> 1

Interpreting Our Model

Eli5
Data
Model
Target Names
Function

In [37]:

from eli5.lime import TextExplainer

In [38]:

pipe.predict_proba

Out[38]:

<function sklearn.pipeline.Pipeline.predict_proba(self, X)>

In [39]:

exp = TextExplainer(random_state=42)

In [42]:

X_test.values[0]

Out[42]:

'Great pork sandwich.'

In [43]:

exp.fit(X_test.values[0], pipe.predict_proba)

Out[43]:

TextExplainer(char_based=False,
       clf=SGDClassifier(alpha=0.001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet',
       power_t=0.5,
       random_state=<mtrand.RandomState object at 0x000002CED2CC19D8>,
       shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
       warm_start=False),
       expand_factor=10, n_samples=5000, position_dependent=False,
       random_state=42, rbf_sigma=None,
       sampler=MaskingTextSamplers(random_state=<mtrand.RandomState object at 0x000002CED2CC19D8>,
          sampler_params=None, token_pattern='(?u)\\b\\w+\\b',
          weights=array([0.7, 0.3])),
       token_pattern='(?u)\\b\\w+\\b',
       vec=CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
        vocabulary=None))

In [44]:

ylabels.unique()

Out[44]:

array([0, 1], dtype=int64)

In [45]:

target_names = ['Negative','Positive']

In [53]:

exp.show_prediction()

Out[53]:

y=1 (probability 0.857, score 1.789) top features

Contribution^?	Feature
+1.859	Highlighted in text (sum)
-0.070	<BIAS>

great pork sandwich.

In [46]:

exp.show_prediction(target_names=target_names)

Out[46]:

y=Positive (probability 0.857, score 1.789) top features

Contribution^?	Feature
+1.859	Highlighted in text (sum)
-0.070	<BIAS>

great pork sandwich.

In [47]:

exp.metrics_

Out[47]:

{'mean_KL_divergence': 0.0006893282060298322, 'score': 1.0}

In [48]:

exp.show_weights()

Out[48]:

y=1 top features

Weight^?	Feature
+1.918	great
+0.144	great pork
+0.133	pork sandwich
-0.070	<BIAS>
-0.118	sandwich
-0.218	pork

In [50]:

# Check For Vectorizer Used
exp.vec_

Out[50]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
        vocabulary=None)

In [51]:

# Check For Classifer Used
exp.clf_

Out[51]:

SGDClassifier(alpha=0.001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet',
       power_t=0.5,
       random_state=<mtrand.RandomState object at 0x000002CECAE63240>,
       shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
       warm_start=False)

Below is a video tutorial of the entire process

Thanks For Reading

Jesus Saves

Text Classification and ML Model Interpretation with ELi5,Sklearn and SpaCy

Interpreting Our Model

Leave a Comment Cancel Reply