In this tutorial we will see how to classify text/document using machine learning and then move on to interpret our classification model with Eli5.
This is the workflow we will be using in this project.
- Text Preprocessing
- We will be using spacy and basic python to preprocess our documents to get a clean dataset
- We will remove all stop words and build a tokenizer and a couple of lemmas.
- This is to help improve our dataset which we will feed into our model.
- Text Classification and Model Building
- Since we are working with text document we will be using countvectorizer or TFID to help in our vectorization
- Vectorization here means we are converting our text into numbers so that our ML will be able to understand it.
- We will then build our model using algorithms from scikitlearn package
- You can choose any algorithm as you like.
- Prediction
- We will then use our trained model to do some predictions
- We calculate the accuracy of our model
- Interpretation of Our Machine Learning Model
- The next step involves using Eli5 to help interpret and explain why our ML model gave us such a prediction
- We will then check for how trustworthy our interpretation is with Eli5’s metrics
Let us check the code
In [1]:
# load EDA Pkgs
import pandas as pd
import numpy as np
In [2]:
# Load NLP pkgs
import spacy
In [3]:
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('en_core_web_sm')
In [4]:
# Use the punctuations of string module
import string
punctuations = string.punctuation
In [5]:
# Creating a Spacy Parser
from spacy.lang.en import English
parser = English()
In [9]:
# Build a list of stopwords to use to filter
stopwords = list(STOP_WORDS)
In [10]:
stopwords
In [11]:
def spacy_tokenizer(sentence):
mytokens = parser(sentence)
mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
return mytokens
In [12]:
ex1 = "He was walking with the walker in the Wall he may had sat and run with the runner"
In [13]:
spacy_tokenizer(ex1)
Out[13]:
In [14]:
# Load ML Pkgs
# ML Packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
In [15]:
# Load Interpretation Pkgs
import eli5
In [16]:
# Load dataset
df = pd.read_csv("sentimentdataset.csv")
In [17]:
df.head()
In [18]:
df.shape
Out[18]:
In [19]:
df.columns
Out[19]:
In [20]:
#Custom transformer using spaCy
class predictors(TransformerMixin):
def transform(self, X, **transform_params):
return [clean_text(text) for text in X]
def fit(self, X, y=None, **fit_params):
return self
def get_params(self, deep=True):
return {}
In [21]:
# Basic function to clean the text
def clean_text(text):
return text.strip().lower()
In [22]:
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
# classifier = LinearSVC()
classifier = SVC(C=150, gamma=2e-2, probability=True)
In [23]:
# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)
In [24]:
# Splitting Data Set
from sklearn.model_selection import train_test_split
In [25]:
# Features and Labels
X = df['Message']
ylabels = df['Target']
In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)
In [27]:
X_train
In [28]:
X_train.shape
Out[28]:
In [29]:
# Create the pipeline to clean, tokenize, vectorize, and classify
pipe = Pipeline([("cleaner", predictors()),
('vectorizer', vectorizer),
('classifier', classifier)])
In [30]:
# Fit our data
pipe.fit(X_train,y_train)
Out[30]:
In [31]:
X_test.shape
Out[31]:
In [32]:
X_test.values[1]
Out[32]:
In [33]:
# Predicting with a test dataset
sample_prediction = pipe.predict(X_test)
In [35]:
print("Accuracy Score:",pipe.score(X_test, y_test))
In [36]:
# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample,pred) in zip(X_test,sample_prediction):
print(sample,"Prediction=>",pred)
Below is a video tutorial of the entire process
Thanks For Reading
Jesus Saves