In this tutorial- which is part of the End-To-End Data Science Project using the Udemy Dataset – we will perform text classification using the title and the subject category. Our aim behind this project is to predict the subject category giving the course title.
By the end of this tutorial you will learn
- What we mean by text classification
- Building Features From Textual Data
- Building ML Models using two approaches
- More
Text Classification
Text classification is the process of assigning text into a predefined category or class. It is a supervised machine learning technique used mostly when working with text. It is similar to topic clustering which utilized an unsupervised ML approach.
There are several types of text classification ;
- Binary Text Classification: classifying text into two target groups
- Multi Class Text Classification: classifying text into more than two target groups
- Multi Label Text Classification: classifying text into more than two target groups that can belong to diverse labels.
We will be using the udemy dataset which is available on kaggle or here. The dataset has a course_title column and a subject column which we will be using as a target label.
We will not use the other columns except these two : course_title,subject
Let us start.
The basic workflow is that we will be using the normal approach for building our model and then use the other alternative also.
Building ML Model Using Normal Approach
First of all we will have to convert our text into numerical word vectors for the ML model to be able to understand. Since every ML algorithm requires numerical data we will have to perform some feature engineering via using CountVectorizer or TfidfVectorizer. The main idea is to transform our data into an augmented word vector that our ML algorithm will understand and be able to process.
After this we will then split our dataset into two for training and testing with our model. Finally we will fit our transformed vectorized data into our ML algorithm which can be either LogisticRegression or Naive Bayes.
We can also interpret our model using Eli5 or Lime.
In [29]:
# Load EDA Pkgs import pandas as pd import neattext.functions as nfx
In [30]:
# Load Data Viz import seaborn as sns
In [31]:
# Load ML Pkgs from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB
In [32]:
# Load Dataset df = pd.read_csv("data/udemy_courses.csv")
df['course_title'].apply(nfx.remove_stopwords)
Out[39]:
0 Ultimate Investment Banking Course 1 Complete GST Course & Certification - Grow Pra... 2 Financial Modeling Business Analysts Consultants 3 Beginner Pro - Financial Analysis Excel 2017 4 Maximize Profits Trading Options ... 3678 Learn jQuery Scratch - Master JavaScript library 3679 Design WordPress Website Coding 3680 Learn Build Polymer 3681 CSS Animations: Create Amazing Effects Website 3682 MODX CMS Build Websites: Beginner's Guide Name: course_title, Length: 3683, dtype: object
In [44]:
# Remove stopwords df['clean_course_title'] = df['course_title'].apply(nfx.remove_stopwords)
In [45]:
df[['clean_course_title','course_title']]
Out[45]:
clean_course_title | course_title | |
---|---|---|
0 | Ultimate Investment Banking Course | Ultimate Investment Banking Course |
1 | Complete GST Course & Certification – Grow Pra… | Complete GST Course & Certification – Grow You… |
2 | Financial Modeling Business Analysts Consultants | Financial Modeling for Business Analysts and C… |
3 | Beginner Pro – Financial Analysis Excel 2017 | Beginner to Pro – Financial Analysis in Excel … |
4 | Maximize Profits Trading Options | How To Maximize Your Profits Trading Options |
… | … | … |
3678 | Learn jQuery Scratch – Master JavaScript library | Learn jQuery from Scratch – Master of JavaScri… |
3679 | Design WordPress Website Coding | How To Design A WordPress Website With No Codi… |
3680 | Learn Build Polymer | Learn and Build using Polymer |
3681 | CSS Animations: Create Amazing Effects Website | CSS Animations: Create Amazing Effects on Your… |
3682 | MODX CMS Build Websites: Beginner’s Guide | Using MODX CMS to Build Websites: A Beginner’s… |
3683 rows × 2 columns In [46]:
# Remove special characters df['clean_course_title'] = df['clean_course_title'].apply(nfx.remove_special_characters)
In [49]:
# Reduce to lowercase df['clean_course_title'] = df['clean_course_title'].str.lower()
In [50]:
df[['clean_course_title','course_title']]
Out[50]:
clean_course_title | course_title | |
---|---|---|
0 | ultimate investment banking course | Ultimate Investment Banking Course |
1 | complete gst course certification grow practice | Complete GST Course & Certification – Grow You… |
2 | financial modeling business analysts consultants | Financial Modeling for Business Analysts and C… |
3 | beginner pro financial analysis excel 2017 | Beginner to Pro – Financial Analysis in Excel … |
4 | maximize profits trading options | How To Maximize Your Profits Trading Options |
… | … | … |
3678 | learn jquery scratch master javascript library | Learn jQuery from Scratch – Master of JavaScri… |
3679 | design wordpress website coding | How To Design A WordPress Website With No Codi… |
3680 | learn build polymer | Learn and Build using Polymer |
3681 | css animations create amazing effects website | CSS Animations: Create Amazing Effects on Your… |
3682 | modx cms build websites beginners guide | Using MODX CMS to Build Websites: A Beginner’s… |
3683 rows × 2 columns
Building Features From the Text
- Convert words to vectors of number
- Tfidf
- Count
- Hashvec
In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [52]:
Xfeatures = df['clean_course_title'] ylabels = df['subject']
In [53]:
Xfeatures
Out[53]:
0 ultimate investment banking course 1 complete gst course certification grow practice 2 financial modeling business analysts consultants 3 beginner pro financial analysis excel 2017 4 maximize profits trading options ... 3678 learn jquery scratch master javascript library 3679 design wordpress website coding 3680 learn build polymer 3681 css animations create amazing effects website 3682 modx cms build websites beginners guide Name: clean_course_title, Length: 3683, dtype: object
In [54]:
tfidf_vec = TfidfVectorizer() X = tfidf_vec.fit_transform(Xfeatures)
In [55]:
X
Out[55]:
<3683x3564 sparse matrix of type '<class 'numpy.float64'>' with 18364 stored elements in Compressed Sparse Row format>
In [56]:
X.todense()
Out[56]:
matrix([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])
In [57]:
# Convert to DF df_vec = pd.DataFrame(X.todense(),columns=tfidf_vec.get_feature_names())
In [59]:
df_vec.T
Out[59]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 3673 | 3674 | 3675 | 3676 | 3677 | 3678 | 3679 | 3680 | 3681 | 3682 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
000005 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
01 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
02 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
10 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
zoho | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zombie | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zu | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zuhause | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
zur | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | … | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3564 rows × 3683 columns In [ ]:
### Building Models + Single Approach* - Separately + Pipeline -Combine
In [60]:
# Split our dataset from sklearn.model_selection import train_test_split
In [61]:
x_train,x_test,y_train,y_test = train_test_split(X,ylabels,test_size=0.3,random_state=42)
In [62]:
x_train.shape
Out[62]:
(2578, 3564)
In [63]:
# Build Model lr_model = LogisticRegression() lr_model.fit(x_train,y_train)
Out[63]:
LogisticRegression()
In [64]:
# Acccuracy lr_model.score(x_test,y_test)
Out[64]:
0.9547511312217195
In [ ]:
### Evaluate our model
In [65]:
from sklearn.metrics import classification_report,confusion_matrix,plot_confusion_matrix
In [66]:
y_pred = lr_model.predict(x_test)
In [68]:
# Confusion Matrix : true pos,false pos,etc confusion_matrix(y_pred,y_test)
Out[68]:
array([[382, 20, 8, 5], [ 1, 142, 0, 2], [ 1, 1, 183, 0], [ 2, 9, 1, 348]])
In [69]:
df['subject'].unique()
Out[69]:
array(['Business Finance', 'Graphic Design', 'Musical Instruments', 'Web Development'], dtype=object)
In [71]:
plot_confusion_matrix(lr_model,x_test,y_test,xticks_rotation=40)
Out[71]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd9c164ac10>
In [72]:
# Classification Report print(classification_report(y_pred,y_test))
precision recall f1-score support Business Finance 0.99 0.92 0.95 415 Graphic Design 0.83 0.98 0.90 145 Musical Instruments 0.95 0.99 0.97 185 Web Development 0.98 0.97 0.97 360 accuracy 0.95 1105 macro avg 0.94 0.96 0.95 1105 weighted avg 0.96 0.95 0.96 1105
In [73]:
### Making A Single Prediction ex = "Building A Simple ML Web App"
In [76]:
def vectorize_text(text): my_vec = tfidf_vec.transform([text]) return my_vec.toarray()
In [77]:
vectorize_text(ex)
Out[77]:
array([[0., 0., 0., ..., 0., 0., 0.]])
In [78]:
sample1 = vectorize_text(ex)
In [79]:
lr_model.predict(sample1)
Out[79]:
array(['Web Development'], dtype=object)
In [80]:
# Prediction Prob lr_model.predict_proba(sample1)
Out[80]:
array([[0.0452693 , 0.03089783, 0.03488388, 0.88894899]])
In [84]:
lr_model.classes_
Out[84]:
array(['Business Finance', 'Graphic Design', 'Musical Instruments', 'Web Development'], dtype=object)
Building ML App Using Pipeline Approach
The above steps can be placed into a pipeline – which is the concept of placing sequence of steps into a single workflow. With ScikitLearn we can use the Pipeline Constructor or the make _pipeline() function for this approach.
Pipeline consists of transformers(functions that take data and changes it to another form of data) and estimators(functions that take in data and produces a model).
In our case our transformer is the CountVectorizer() used to build features from text whiles our estimator is the LogisticRegression() or Naive Bayes Estimator.
We can now make a pipeline and use it to perform our text classification.
Below is the full code
We can also check out the code for the pipeline approach
In [100]:
### Method 2: Pipeline Approach # Transformers tf_vec = TfidfVectorizer() # Estimators lr_clf = LogisticRegression() nv_clf = MultinomialNB()
In [101]:
from sklearn.pipeline import make_pipeline,Pipeline
In [102]:
pipe_lr = make_pipeline(tf_vec,lr_clf)
In [103]:
pipe_nv = make_pipeline(tf_vec,nv_clf)
In [104]:
# Steps pipe_lr.steps
Out[104]:
[('tfidfvectorizer', TfidfVectorizer()), ('logisticregression', LogisticRegression())]
In [105]:
x_train2,x_test2,y_train2,y_test2 = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42)
In [106]:
x_train2
Out[106]:
3068 getting started html 2889 web security common vulnerabilities mitigation 3338 introduction qgis python programming 168 accounting basics 66 minutes absolutely beginners 3414 complete login registration system php mysql ... 1130 complete forex trader 1294 santa claus photoshop manipulation 860 cfa level foundation introduction financial re... 3507 professional css flexbox 3174 supercharging development atom text editor Name: clean_course_title, Length: 2578, dtype: object
In [107]:
# Fit Our dataset pipe_lr = pipe_lr.fit(x_train2,y_train2)
In [108]:
pipe_lr.score(x_test2,y_test2)
Out[108]:
0.9601809954751132
In [109]:
# Fit Our dataset pipe_nv = pipe_nv.fit(x_train2,y_train2) pipe_nv.score(x_test2,y_test2)
Out[109]:
0.9420814479638009
In [111]:
pipe_nv.predict([ex])
Out[111]:
array(['Web Development'], dtype='<U19')
In [ ]:
You can check out the entire video tutorial below and the code here.
Thanks For Your Time
Jesus Saves
By Jesse E.Agbe(JCharis)
Hey, I just hopped over to your site via StumbleUpon. Not somthing I would normally read, but I liked your thoughts none the less. Thanks for making something worth reading.
Python Course in Hyderabad | Best Python Training Institutes in Hyderabad
Thanks a lot , I am glad it was helpful
Hey Jesse…I am a regular visitor to all of your blogs and love all your projects.
I loved the Face Detection App you did using Streamlit , OpenCV, and Python
I have a request :
Can you please do a Face Recognition App(Like a Training Model which recognizes the person while uploading the picture) similar to the Face Detection in Streamlit itself using Python? [I badly need it for my Mini Project… It would be good if you could prepare a video or an article on the same ]
Please reply soon !!
Glad they are helpful Ashwin. Thanks for the suggestion. I will look into it.