Text Classification with Machine Learning Using Udemy Dataset and Python

In this tutorial- which is part of the End-To-End Data Science Project using the Udemy Dataset – we will perform text classification using the title and the subject category. Our aim behind this project is to predict the subject category giving the course title.

By the end of this tutorial you will learn

What we mean by text classification
Building Features From Textual Data
Building ML Models using two approaches
More

Text Classification

Text classification is the process of assigning text into a predefined category or class. It is a supervised machine learning technique used mostly when working with text. It is similar to topic clustering which utilized an unsupervised ML approach.

There are several types of text classification ;

Binary Text Classification: classifying text into two target groups
Multi Class Text Classification: classifying text into more than two target groups
Multi Label Text Classification: classifying text into more than two target groups that can belong to diverse labels.

We will be using the udemy dataset which is available on kaggle or here. The dataset has a course_title column and a subject column which we will be using as a target label.

We will not use the other columns except these two : course_title,subject

Let us start.

The basic workflow is that we will be using the normal approach for building our model and then use the other alternative also.

Building ML Model Using Normal Approach

First of all we will have to convert our text into numerical word vectors for the ML model to be able to understand. Since every ML algorithm requires numerical data we will have to perform some feature engineering via using CountVectorizer or TfidfVectorizer. The main idea is to transform our data into an augmented word vector that our ML algorithm will understand and be able to process.

After this we will then split our dataset into two for training and testing with our model. Finally we will fit our transformed vectorized data into our ML algorithm which can be either LogisticRegression or Naive Bayes.

We can also interpret our model using Eli5 or Lime.

In [29]:

# Load EDA Pkgs
import pandas as pd
import neattext.functions as nfx

In [30]:

# Load Data Viz
import seaborn as sns

In [31]:

# Load ML Pkgs
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [32]:

# Load Dataset
df = pd.read_csv("data/udemy_courses.csv")

df['course_title'].apply(nfx.remove_stopwords)

Out[39]:

0                      Ultimate Investment Banking Course
1       Complete GST Course & Certification - Grow Pra...
2        Financial Modeling Business Analysts Consultants
3            Beginner Pro - Financial Analysis Excel 2017
4                        Maximize Profits Trading Options
                              ...                        
3678     Learn jQuery Scratch - Master JavaScript library
3679                      Design WordPress Website Coding
3680                                  Learn Build Polymer
3681       CSS Animations: Create Amazing Effects Website
3682            MODX CMS Build Websites: Beginner's Guide
Name: course_title, Length: 3683, dtype: object

In [44]:

# Remove stopwords
df['clean_course_title'] = df['course_title'].apply(nfx.remove_stopwords)

In [45]:

df[['clean_course_title','course_title']]

Out[45]:

	clean_course_title	course_title
0	Ultimate Investment Banking Course	Ultimate Investment Banking Course
1	Complete GST Course & Certification – Grow Pra…	Complete GST Course & Certification – Grow You…
2	Financial Modeling Business Analysts Consultants	Financial Modeling for Business Analysts and C…
3	Beginner Pro – Financial Analysis Excel 2017	Beginner to Pro – Financial Analysis in Excel …
4	Maximize Profits Trading Options	How To Maximize Your Profits Trading Options
…	…	…
3678	Learn jQuery Scratch – Master JavaScript library	Learn jQuery from Scratch – Master of JavaScri…
3679	Design WordPress Website Coding	How To Design A WordPress Website With No Codi…
3680	Learn Build Polymer	Learn and Build using Polymer
3681	CSS Animations: Create Amazing Effects Website	CSS Animations: Create Amazing Effects on Your…
3682	MODX CMS Build Websites: Beginner’s Guide	Using MODX CMS to Build Websites: A Beginner’s…

3683 rows × 2 columns In [46]:

# Remove special characters
df['clean_course_title'] = df['clean_course_title'].apply(nfx.remove_special_characters)

In [49]:

# Reduce to lowercase
df['clean_course_title'] = df['clean_course_title'].str.lower()

In [50]:

df[['clean_course_title','course_title']]

Out[50]:

	clean_course_title	course_title
0	ultimate investment banking course	Ultimate Investment Banking Course
1	complete gst course certification grow practice	Complete GST Course & Certification – Grow You…
2	financial modeling business analysts consultants	Financial Modeling for Business Analysts and C…
3	beginner pro financial analysis excel 2017	Beginner to Pro – Financial Analysis in Excel …
4	maximize profits trading options	How To Maximize Your Profits Trading Options
…	…	…
3678	learn jquery scratch master javascript library	Learn jQuery from Scratch – Master of JavaScri…
3679	design wordpress website coding	How To Design A WordPress Website With No Codi…
3680	learn build polymer	Learn and Build using Polymer
3681	css animations create amazing effects website	CSS Animations: Create Amazing Effects on Your…
3682	modx cms build websites beginners guide	Using MODX CMS to Build Websites: A Beginner’s…

3683 rows × 2 columns

Building Features From the Text

Convert words to vectors of number
Tfidf
Count
Hashvec

In [51]:

from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:

Xfeatures = df['clean_course_title']
ylabels = df['subject']

In [53]:

Xfeatures

Out[53]:

0                      ultimate investment banking course
1       complete gst course  certification  grow practice
2        financial modeling business analysts consultants
3             beginner pro  financial analysis excel 2017
4                        maximize profits trading options
                              ...                        
3678      learn jquery scratch  master javascript library
3679                      design wordpress website coding
3680                                  learn build polymer
3681        css animations create amazing effects website
3682              modx cms build websites beginners guide
Name: clean_course_title, Length: 3683, dtype: object

In [54]:

tfidf_vec = TfidfVectorizer()
X = tfidf_vec.fit_transform(Xfeatures)

In [55]:

Out[55]:

<3683x3564 sparse matrix of type '<class 'numpy.float64'>'
	with 18364 stored elements in Compressed Sparse Row format>

In [56]:

X.todense()

Out[56]:

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [57]:

# Convert to DF
df_vec = pd.DataFrame(X.todense(),columns=tfidf_vec.get_feature_names())

In [59]:

df_vec.T

Out[59]:

	0	1	2	3	4	5	6	7	8	9	…	3673	3674	3675	3676	3677	3678	3679	3680	3681	3682
000005	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
001	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
01	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
02	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
10	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
zoho	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
zombie	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
zu	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
zuhause	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
zur	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	…	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

3564 rows × 3683 columns In [ ]:

### Building Models
+ Single Approach*
    - Separately
+ Pipeline
    -Combine

In [60]:

# Split our dataset
from sklearn.model_selection import train_test_split

In [61]:

x_train,x_test,y_train,y_test = train_test_split(X,ylabels,test_size=0.3,random_state=42)

In [62]:

x_train.shape

Out[62]:

(2578, 3564)

In [63]:

# Build Model
lr_model = LogisticRegression()
lr_model.fit(x_train,y_train)

Out[63]:

LogisticRegression()

In [64]:

# Acccuracy
lr_model.score(x_test,y_test)

Out[64]:

0.9547511312217195

In [ ]:

### Evaluate our model

In [65]:

from sklearn.metrics import classification_report,confusion_matrix,plot_confusion_matrix

In [66]:

y_pred = lr_model.predict(x_test)

In [68]:

# Confusion Matrix : true pos,false pos,etc
confusion_matrix(y_pred,y_test)

Out[68]:

array([[382,  20,   8,   5],
       [  1, 142,   0,   2],
       [  1,   1, 183,   0],
       [  2,   9,   1, 348]])

In [69]:

df['subject'].unique()

Out[69]:

array(['Business Finance', 'Graphic Design', 'Musical Instruments',
       'Web Development'], dtype=object)

In [71]:

plot_confusion_matrix(lr_model,x_test,y_test,xticks_rotation=40)

Out[71]:

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd9c164ac10>

In [72]:

# Classification Report
print(classification_report(y_pred,y_test))

                     precision    recall  f1-score   support

   Business Finance       0.99      0.92      0.95       415
     Graphic Design       0.83      0.98      0.90       145
Musical Instruments       0.95      0.99      0.97       185
    Web Development       0.98      0.97      0.97       360

           accuracy                           0.95      1105
          macro avg       0.94      0.96      0.95      1105
       weighted avg       0.96      0.95      0.96      1105

In [73]:

### Making A Single Prediction
ex = "Building A Simple ML Web App"

In [76]:

def vectorize_text(text):
    my_vec = tfidf_vec.transform([text])
    return my_vec.toarray()

In [77]:

vectorize_text(ex)

Out[77]:

array([[0., 0., 0., ..., 0., 0., 0.]])

In [78]:

sample1 = vectorize_text(ex)

In [79]:

lr_model.predict(sample1)

Out[79]:

array(['Web Development'], dtype=object)

In [80]:

# Prediction Prob
lr_model.predict_proba(sample1)

Out[80]:

array([[0.0452693 , 0.03089783, 0.03488388, 0.88894899]])

In [84]:

lr_model.classes_

Out[84]:

array(['Business Finance', 'Graphic Design', 'Musical Instruments',
       'Web Development'], dtype=object)

Building ML App Using Pipeline Approach

The above steps can be placed into a pipeline – which is the concept of placing sequence of steps into a single workflow. With ScikitLearn we can use the Pipeline Constructor or the make _pipeline() function for this approach.

Pipeline consists of transformers(functions that take data and changes it to another form of data) and estimators(functions that take in data and produces a model).

In our case our transformer is the CountVectorizer() used to build features from text whiles our estimator is the LogisticRegression() or Naive Bayes Estimator.

We can now make a pipeline and use it to perform our text classification.

Below is the full code

We can also check out the code for the pipeline approach

In [100]:

### Method 2: Pipeline Approach
# Transformers
tf_vec = TfidfVectorizer()
# Estimators
lr_clf = LogisticRegression()
nv_clf = MultinomialNB()

In [101]:

from sklearn.pipeline import make_pipeline,Pipeline

In [102]:

pipe_lr = make_pipeline(tf_vec,lr_clf)

In [103]:

pipe_nv = make_pipeline(tf_vec,nv_clf)

In [104]:

# Steps
pipe_lr.steps

Out[104]:

[('tfidfvectorizer', TfidfVectorizer()),
 ('logisticregression', LogisticRegression())]

In [105]:

x_train2,x_test2,y_train2,y_test2 = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42)

In [106]:

x_train2

Out[106]:

3068                                 getting started html
2889       web security common vulnerabilities mitigation
3338                 introduction qgis python programming
168     accounting basics 66 minutes absolutely beginners
3414         complete login registration system php mysql
                              ...                        
1130                                complete forex trader
1294                   santa claus photoshop manipulation
860     cfa level foundation introduction financial re...
3507                             professional css flexbox
3174           supercharging development atom text editor
Name: clean_course_title, Length: 2578, dtype: object

In [107]:

# Fit Our dataset
pipe_lr = pipe_lr.fit(x_train2,y_train2)

In [108]:

pipe_lr.score(x_test2,y_test2)

Out[108]:

0.9601809954751132

In [109]:

# Fit Our dataset
pipe_nv = pipe_nv.fit(x_train2,y_train2)
pipe_nv.score(x_test2,y_test2)

Out[109]:

0.9420814479638009

In [111]:

pipe_nv.predict([ex])

Out[111]:

array(['Web Development'], dtype='<U19')

In [ ]:

You can check out the entire video tutorial below and the code here.

Thanks For Your Time

Jesus Saves

By Jesse E.Agbe(JCharis)

4 thoughts on “Text Classification with Machine Learning Using Udemy Dataset and Python”

Future Gen Technologies
April 8, 2021 at 1:08 pm

Hey, I just hopped over to your site via StumbleUpon. Not somthing I would normally read, but I liked your thoughts none the less. Thanks for making something worth reading.

Python Course in Hyderabad | Best Python Training Institutes in Hyderabad

jesse_jcharis
April 18, 2021 at 11:20 am

Thanks a lot , I am glad it was helpful

Ashwin
July 21, 2021 at 11:59 am

Hey Jesse…I am a regular visitor to all of your blogs and love all your projects.
I loved the Face Detection App you did using Streamlit , OpenCV, and Python
I have a request :
Can you please do a Face Recognition App(Like a Training Model which recognizes the person while uploading the picture) similar to the Face Detection in Streamlit itself using Python? [I badly need it for my Mini Project… It would be good if you could prepare a video or an article on the same ]
Please reply soon !!

1. jesse_jcharis
  July 26, 2021 at 1:57 pm
  
  Glad they are helpful Ashwin. Thanks for the suggestion. I will look into it.

Building ML Model Using Normal Approach

Building Features From the Text

Building ML App Using Pipeline Approach

4 thoughts on “Text Classification with Machine Learning Using Udemy Dataset and Python”

Leave a Comment Cancel Reply