Text Classification with Machine Learning Using Udemy Dataset and Python

In this tutorial- which is part of the End-To-End Data Science Project using the Udemy Dataset – we will perform text classification using the title and the subject category. Our aim behind this project is to predict the subject category giving the course title.

By the end of this tutorial you will learn

  • What we mean by text classification
  • Building Features From Textual Data
  • Building ML Models using two approaches
  • More

Text Classification

Text classification is the process of assigning text into a predefined category or class. It is a supervised machine learning technique used mostly when working with text. It is similar to topic clustering which utilized an unsupervised ML approach.

There are several types of text classification ;

  • Binary Text Classification: classifying text into two target groups
  • Multi Class Text Classification: classifying text into more than two target groups
  • Multi Label Text Classification: classifying text into more than two target groups that can belong to diverse labels.

We will be using the udemy dataset which is available on kaggle or here. The dataset has a course_title column and a subject column which we will be using as a target label.

We will not use the other columns except these two : course_title,subject

Let us start.

The basic workflow is that we will be using the normal approach for building our model and then use the other alternative also.

Building ML Model Using Normal Approach

First of all we will have to convert our text into numerical word vectors for the ML model to be able to understand. Since every ML algorithm requires numerical data we will have to perform some feature engineering via using CountVectorizer or TfidfVectorizer. The main idea is to transform our data into an augmented word vector that our ML algorithm will understand and be able to process.

After this we will then split our dataset into two for training and testing with our model. Finally we will fit our transformed vectorized data into our ML algorithm which can be either LogisticRegression or Naive Bayes.

We can also interpret our model using Eli5 or Lime.

In [29]:

# Load EDA Pkgs
import pandas as pd
import neattext.functions as nfx

In [30]:

# Load Data Viz
import seaborn as sns

In [31]:

# Load ML Pkgs
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [32]:

# Load Dataset
df = pd.read_csv("data/udemy_courses.csv")


0                      Ultimate Investment Banking Course
1       Complete GST Course & Certification - Grow Pra...
2        Financial Modeling Business Analysts Consultants
3            Beginner Pro - Financial Analysis Excel 2017
4                        Maximize Profits Trading Options
3678     Learn jQuery Scratch - Master JavaScript library
3679                      Design WordPress Website Coding
3680                                  Learn Build Polymer
3681       CSS Animations: Create Amazing Effects Website
3682            MODX CMS Build Websites: Beginner's Guide
Name: course_title, Length: 3683, dtype: object

In [44]:

# Remove stopwords
df['clean_course_title'] = df['course_title'].apply(nfx.remove_stopwords)

In [45]:



0Ultimate Investment Banking CourseUltimate Investment Banking Course
1Complete GST Course & Certification – Grow Pra…Complete GST Course & Certification – Grow You…
2Financial Modeling Business Analysts ConsultantsFinancial Modeling for Business Analysts and C…
3Beginner Pro – Financial Analysis Excel 2017Beginner to Pro – Financial Analysis in Excel …
4Maximize Profits Trading OptionsHow To Maximize Your Profits Trading Options
3678Learn jQuery Scratch – Master JavaScript libraryLearn jQuery from Scratch – Master of JavaScri…
3679Design WordPress Website CodingHow To Design A WordPress Website With No Codi…
3680Learn Build PolymerLearn and Build using Polymer
3681CSS Animations: Create Amazing Effects WebsiteCSS Animations: Create Amazing Effects on Your…
3682MODX CMS Build Websites: Beginner’s GuideUsing MODX CMS to Build Websites: A Beginner’s…

3683 rows × 2 columns In [46]:

# Remove special characters
df['clean_course_title'] = df['clean_course_title'].apply(nfx.remove_special_characters)

In [49]:

# Reduce to lowercase
df['clean_course_title'] = df['clean_course_title'].str.lower()

In [50]:



0ultimate investment banking courseUltimate Investment Banking Course
1complete gst course certification grow practiceComplete GST Course & Certification – Grow You…
2financial modeling business analysts consultantsFinancial Modeling for Business Analysts and C…
3beginner pro financial analysis excel 2017Beginner to Pro – Financial Analysis in Excel …
4maximize profits trading optionsHow To Maximize Your Profits Trading Options
3678learn jquery scratch master javascript libraryLearn jQuery from Scratch – Master of JavaScri…
3679design wordpress website codingHow To Design A WordPress Website With No Codi…
3680learn build polymerLearn and Build using Polymer
3681css animations create amazing effects websiteCSS Animations: Create Amazing Effects on Your…
3682modx cms build websites beginners guideUsing MODX CMS to Build Websites: A Beginner’s…

3683 rows × 2 columns

Building Features From the Text

  • Convert words to vectors of number
  • Tfidf
  • Count
  • Hashvec

In [51]:

from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:

Xfeatures = df['clean_course_title']
ylabels = df['subject']

In [53]:



0                      ultimate investment banking course
1       complete gst course  certification  grow practice
2        financial modeling business analysts consultants
3             beginner pro  financial analysis excel 2017
4                        maximize profits trading options
3678      learn jquery scratch  master javascript library
3679                      design wordpress website coding
3680                                  learn build polymer
3681        css animations create amazing effects website
3682              modx cms build websites beginners guide
Name: clean_course_title, Length: 3683, dtype: object

In [54]:

tfidf_vec = TfidfVectorizer()
X = tfidf_vec.fit_transform(Xfeatures)

In [55]:



<3683x3564 sparse matrix of type '<class 'numpy.float64'>'
	with 18364 stored elements in Compressed Sparse Row format>

In [56]:



matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [57]:

# Convert to DF
df_vec = pd.DataFrame(X.todense(),columns=tfidf_vec.get_feature_names())

In [59]:




3564 rows × 3683 columns In [ ]:

### Building Models
+ Single Approach*
    - Separately
+ Pipeline

In [60]:

# Split our dataset
from sklearn.model_selection import train_test_split

In [61]:

x_train,x_test,y_train,y_test = train_test_split(X,ylabels,test_size=0.3,random_state=42)

In [62]:



(2578, 3564)

In [63]:

# Build Model
lr_model = LogisticRegression()



In [64]:

# Acccuracy



In [ ]:

### Evaluate our model

In [65]:

from sklearn.metrics import classification_report,confusion_matrix,plot_confusion_matrix

In [66]:

y_pred = lr_model.predict(x_test)

In [68]:

# Confusion Matrix : true pos,false pos,etc


array([[382,  20,   8,   5],
       [  1, 142,   0,   2],
       [  1,   1, 183,   0],
       [  2,   9,   1, 348]])

In [69]:



array(['Business Finance', 'Graphic Design', 'Musical Instruments',
       'Web Development'], dtype=object)

In [71]:



<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd9c164ac10>

In [72]:

# Classification Report
                     precision    recall  f1-score   support

   Business Finance       0.99      0.92      0.95       415
     Graphic Design       0.83      0.98      0.90       145
Musical Instruments       0.95      0.99      0.97       185
    Web Development       0.98      0.97      0.97       360

           accuracy                           0.95      1105
          macro avg       0.94      0.96      0.95      1105
       weighted avg       0.96      0.95      0.96      1105

In [73]:

### Making A Single Prediction
ex = "Building A Simple ML Web App"

In [76]:

def vectorize_text(text):
    my_vec = tfidf_vec.transform([text])
    return my_vec.toarray()

In [77]:



array([[0., 0., 0., ..., 0., 0., 0.]])

In [78]:

sample1 = vectorize_text(ex)

In [79]:



array(['Web Development'], dtype=object)

In [80]:

# Prediction Prob


array([[0.0452693 , 0.03089783, 0.03488388, 0.88894899]])

In [84]:



array(['Business Finance', 'Graphic Design', 'Musical Instruments',
       'Web Development'], dtype=object)

Building ML App Using Pipeline Approach

The above steps can be placed into a pipeline – which is the concept of placing sequence of steps into a single workflow. With ScikitLearn we can use the Pipeline Constructor or the make _pipeline() function for this approach.

Pipeline consists of transformers(functions that take data and changes it to another form of data) and estimators(functions that take in data and produces a model).

In our case our transformer is the CountVectorizer() used to build features from text whiles our estimator is the LogisticRegression() or Naive Bayes Estimator.

We can now make a pipeline and use it to perform our text classification.

Below is the full code

We can also check out the code for the pipeline approach

In [100]:

### Method 2: Pipeline Approach
# Transformers
tf_vec = TfidfVectorizer()
# Estimators
lr_clf = LogisticRegression()
nv_clf = MultinomialNB()

In [101]:

from sklearn.pipeline import make_pipeline,Pipeline

In [102]:

pipe_lr = make_pipeline(tf_vec,lr_clf)

In [103]:

pipe_nv = make_pipeline(tf_vec,nv_clf)

In [104]:

# Steps


[('tfidfvectorizer', TfidfVectorizer()),
 ('logisticregression', LogisticRegression())]

In [105]:

x_train2,x_test2,y_train2,y_test2 = train_test_split(Xfeatures,ylabels,test_size=0.3,random_state=42)

In [106]:



3068                                 getting started html
2889       web security common vulnerabilities mitigation
3338                 introduction qgis python programming
168     accounting basics 66 minutes absolutely beginners
3414         complete login registration system php mysql
1130                                complete forex trader
1294                   santa claus photoshop manipulation
860     cfa level foundation introduction financial re...
3507                             professional css flexbox
3174           supercharging development atom text editor
Name: clean_course_title, Length: 2578, dtype: object

In [107]:

# Fit Our dataset
pipe_lr = pipe_lr.fit(x_train2,y_train2)

In [108]:




In [109]:

# Fit Our dataset
pipe_nv = pipe_nv.fit(x_train2,y_train2)



In [111]:



array(['Web Development'], dtype='<U19')

In [ ]:

You can check out the entire video tutorial below and the code here.

Thanks For Your Time

Jesus Saves

By Jesse E.Agbe(JCharis)

