In today’s tutorial we will be exploring another powerful natural language processing library called Flair. We will be seeing how to use Flair for text classification. Specifically we will learn how to build a model to predict or classify text as either offensive or non-offensive.
What is Flair?
Flair is a state of the art(SOTA) NLP Library built on top of Pytorch which useful for performing several natural language processing task such as
- Sequence Labeling
- Text/Linguistic Annotation
- Named Entity Recognition
- Tagging
- Text Classification and Sentiment Analysis
- Semantic Frame Detection
- etc
It is also known in the Biomedical field as a BioMedical NER Library via using SciSpacy and several pretrained models.
Basic Overview of Flair
The basic overview of Flair includes the following
Let us see how to work with Flair
Installation
To work with flair , you can install it using pip as below
pip install flair
You can also try Flair inside Google’s Colab for simplicity.
Let us move on to our main task – Text Classification using Flair.
To explore more on NLP with Flair you can check out this course.
Text Classification with Flair
Text classification is one of the most useful and common applications of Natural Language Processing. It involves the process of identifying or grouping text into their specific class or categories. There are several ways we can achieve this process but in our case we will be training our own ML model to classify our text as either offensive or non-offensive.
Let us check the simple workflow for performing text classification with Flair
It is essential to understand this in order to make it easier for us in this task.
There are basically 6 steps
- Step1: Prepare Dataset ( as either csv, or fastText format)
- Step2: Split the dataset into 3 (train,test,dev)
- Step3: Create Corpus and Label Dictionary
- Step4: Add Word Embeddings
- Step5: Instantiate Model and Train using the data
- Step6: Use Model to Make Prediction
Preparing Dataset
As we always do we will have to clean our textual dataset to remove noisy,punctuations and special characters. You can simplify this process using the NeatText text cleaning package. Noticed that we have already cleaned our dataset to make it easier for us. Hence we will be using the cleaned dataset (easy)
Now in preparing the dataset, we can use the normal csv format or the FastText format. The FastText format follows the pattern of using
`__label__<class> <text>
`
But you can also use the normal dataset in csv format, the only difference is that you will have to use the CSVClassificationCorpus when building your corpus for the training instead of the ClassificationCorpus for the FastText format.
Splitting the Dataset into 3
One of the requirement for working with Flair for text classification and model building is to have 3 dataset named as train.csv,test.csv,dev.csv (.txt if you are using fasttext format). These are to ensure that we have data for training,testing and validating when we are building the ML model.
We will be using numpy to help us do the splitting but you can also use train_test_split from sklearn if you want. Numpy has a simple function for splitting data/arrays into proportions as you want – the np.split() function.
Creating our Corpus and Label Dictionary
In creating the corpus you can use the CSVClassificationCorpus (for the CSV) and the ClassificationCorpus(for the fastText format) alongside the 3 splitted datasets
Word Embeddings
Flair offers the option of using several word embedding as you want. You can even use the word embeddings from Flair – FlairEmbedding. You can also stack different word embeddings together.
We will ten proceed to build and train our model respectively.
You can check out the code below
In [4]:
# Load EDA Pkgs import pandas as pd import numpy as np
In [5]:
df = pd.read_csv("offensive_vs_non_offensive_mini_dataset.csv")
In [6]:
df.head()
Out[6]:
Unnamed: 0 | clean_tweet | class | labels | |
---|---|---|---|---|
0 | 0 | look at what you just said lls new era girl … | 1 | offensive |
1 | 1 | driving the fucktardmobile tranny slips and a… | 1 | offensive |
2 | 2 | if i ever put ma trust ina bitch i will alwa… | 1 | offensive |
3 | 3 | stop twatching me bitch | 1 | offensive |
4 | 4 | you know bitches be mad when they be lik… | 1 | offensive |
In [7]:
# Check for value count df['class'].value_counts()
Out[7]:
1 3850 0 821 Name: class, dtype: int64
In [8]:
import seaborn as sns
In [9]:
sns.countplot(df['class'])
/usr/local/lib/python3.6/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f52ed97eeb8>
In [10]:
# Preparing Dataset # Using CSV df.head()
Out[10]:
Unnamed: 0 | clean_tweet | class | labels | |
---|---|---|---|---|
0 | 0 | look at what you just said lls new era girl … | 1 | offensive |
1 | 1 | driving the fucktardmobile tranny slips and a… | 1 | offensive |
2 | 2 | if i ever put ma trust ina bitch i will alwa… | 1 | offensive |
3 | 3 | stop twatching me bitch | 1 | offensive |
4 | 4 | you know bitches be mad when they be lik… | 1 | offensive |
In [11]:
df.columns
Out[11]:
Index(['Unnamed: 0', 'clean_tweet', 'class', 'labels'], dtype='object')
In [12]:
df1 = df[['clean_tweet','labels']]
In [14]:
# Rename Columns df1.columns = ['text','labels']
In [15]:
df1
Out[15]:
text | labels | |
---|---|---|
0 | look at what you just said lls new era girl … | offensive |
1 | driving the fucktardmobile tranny slips and a… | offensive |
2 | if i ever put ma trust ina bitch i will alwa… | offensive |
3 | stop twatching me bitch | offensive |
4 | you know bitches be mad when they be lik… | offensive |
… | … | … |
4666 | this bitch gonna steal a police uniform and th… | offensive |
4667 | if california chrome does not go off at even m… | offensive |
4668 | i do not love you hoes | offensive |
4669 | lmaoooo white people lmaoo filth … | offensive |
4670 | if you really wanna please your man seeing… | offensive |
4671 rows × 2 columns In [16]:
# Prepare for FastText Format #__label__ <class> <text> df1.head()
Out[16]:
text | labels | |
---|---|---|
0 | look at what you just said lls new era girl … | offensive |
1 | driving the fucktardmobile tranny slips and a… | offensive |
2 | if i ever put ma trust ina bitch i will alwa… | offensive |
3 | stop twatching me bitch | offensive |
4 | you know bitches be mad when they be lik… | offensive |
In [17]:
# For FastText df_fst = df1.copy()
In [18]:
df_fst.head()
Out[18]:
text | labels | |
---|---|---|
0 | look at what you just said lls new era girl … | offensive |
1 | driving the fucktardmobile tranny slips and a… | offensive |
2 | if i ever put ma trust ina bitch i will alwa… | offensive |
3 | stop twatching me bitch | offensive |
4 | you know bitches be mad when they be lik… | offensive |
In [19]:
'__label__' + df_fst['labels'].astype(str)
Out[19]:
0 __label__offensive 1 __label__offensive 2 __label__offensive 3 __label__offensive 4 __label__offensive ... 4666 __label__offensive 4667 __label__offensive 4668 __label__offensive 4669 __label__offensive 4670 __label__offensive Name: labels, Length: 4671, dtype: object
In [20]:
df_fst['labels'] = '__label__' + df_fst['labels'].astype(str)
In [22]:
df_fst = df_fst[['labels','text']]
In [ ]:
### Spliting Dataset into 3 ### train,test,dev.csv #### 60,20,20
In [24]:
# Using Numpy train,test,dev = np.split(df1,[int(.6*len(df1)),int(.8*len(df1))])
In [26]:
print(df1.shape) print(train.shape) print(test.shape) print(dev.shape)
(4671, 2) (2802, 2) (934, 2) (935, 2)
In [27]:
# Create A Folder for the csv !mkdir -p data
In [28]:
train.to_csv("data/train.csv") test.to_csv("data/test.csv") dev.to_csv("data/dev.csv")
In [29]:
!ls data
dev.csv test.csv train.csv
In [30]:
df_fst
Out[30]:
labels | text | |
---|---|---|
0 | __label__offensive | look at what you just said lls new era girl … |
1 | __label__offensive | driving the fucktardmobile tranny slips and a… |
2 | __label__offensive | if i ever put ma trust ina bitch i will alwa… |
3 | __label__offensive | stop twatching me bitch |
4 | __label__offensive | you know bitches be mad when they be lik… |
… | … | … |
4666 | __label__offensive | this bitch gonna steal a police uniform and th… |
4667 | __label__offensive | if california chrome does not go off at even m… |
4668 | __label__offensive | i do not love you hoes |
4669 | __label__offensive | lmaoooo white people lmaoo filth … |
4670 | __label__offensive | if you really wanna please your man seeing… |
4671 rows × 2 columns In [31]:
# Spliting FastText Format Dataset into 3 # Using Numpy train_fst,test_fst,dev_fst = np.split(df_fst,[int(.6*len(df_fst)),int(.8*len(df_fst))])
In [32]:
# Store in a folder !mkdir -p data_fst
In [33]:
train_fst.to_csv("data_fst/train.csv",sep='\t',index=False,header=False) test_fst.to_csv("data_fst/test.csv",sep='\t',index=False,header=False) dev_fst.to_csv("data_fst/dev.csv",sep='\t',index=False,header=False)
In [34]:
!ls data_fst
dev.csv test.csv train.csv
In [ ]:
### Building our Corpus # CSVClassificationCorpus # ClassificationCorpus
In [35]:
from flair.datasets import ClassificationCorpus,CSVClassificationCorpus from flair.data import Corpus
In [36]:
# For CSV df1.columns
Out[36]:
Index(['text', 'labels'], dtype='object')
In [55]:
# Create Column Mapping to show which column is for label and text column_name_map = {2:"label_topic",1:"text"}
In [39]:
# Location for CSV data_folder = 'data/'
In [56]:
# Create Corpus For CSV corpus_csv: Corpus = CSVClassificationCorpus(data_folder,column_name_map=column_name_map,skip_header=True,delimiter=',')
2020-10-04 13:59:26,410 Reading data from data 2020-10-04 13:59:26,414 Train: data/train.csv 2020-10-04 13:59:26,416 Dev: data/dev.csv 2020-10-04 13:59:26,417 Test: data/test.csv
In [41]:
# Method 2 Using FastText Format data_folder_fst = 'data_fst/'
In [42]:
corpus_fst: Corpus = ClassificationCorpus(data_folder_fst)
2020-10-04 13:32:54,135 Reading data from data_fst 2020-10-04 13:32:54,137 Train: data_fst/train.csv 2020-10-04 13:32:54,138 Dev: data_fst/dev.csv 2020-10-04 13:32:54,139 Test: data_fst/test.csv
In [57]:
# Creating the Label Diction For CSV label_dict_csv = corpus_csv.make_label_dictionary()
2020-10-04 13:59:43,582 Computing label dictionary. Progress:
100%|██████████| 3736/3736 [00:02<00:00, 1383.38it/s]
2020-10-04 13:59:46,550 [b'offensive', b'non_offensive']
In [44]:
# Creating the Label Diction For FastText label_dict_fst = corpus_fst.make_label_dictionary()
2020-10-04 13:34:57,415 Computing label dictionary. Progress:
100%|██████████| 3733/3733 [00:02<00:00, 1827.83it/s]
2020-10-04 13:34:59,588 [b'offensive', b'non_offensive']
In [45]:
# Working with the Word Embeddings from flair.embeddings import FlairEmbeddings,WordEmbeddings,StackedEmbeddings,DocumentLSTMEmbeddings,DocumentRNNEmbeddings
In [46]:
# Create our WEmbeddings word_embeddings = [FlairEmbeddings('news-forward-fast'),FlairEmbeddings('news-backward-fast')]
2020-10-04 13:40:09,332 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-news-english-forward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmpoq0qzh98
100%|██████████| 19689779/19689779 [00:00<00:00, 37035937.62B/s]
2020-10-04 13:40:09,930 copying /tmp/tmpoq0qzh98 to cache at /root/.flair/embeddings/lm-news-english-forward-1024-v0.2rc.pt 2020-10-04 13:40:09,977 removing temp file /tmp/tmpoq0qzh98
2020-10-04 13:40:10,619 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/lm-news-english-backward-1024-v0.2rc.pt not found in cache, downloading to /tmp/tmpr4dnpuah
100%|██████████| 19689779/19689779 [00:00<00:00, 36642750.83B/s]
2020-10-04 13:40:11,225 copying /tmp/tmpr4dnpuah to cache at /root/.flair/embeddings/lm-news-english-backward-1024-v0.2rc.pt 2020-10-04 13:40:11,255 removing temp file /tmp/tmpr4dnpuah
In [62]:
# Document Embeddings document_embeddings = DocumentLSTMEmbeddings(word_embeddings,hidden_size=512,reproject_words=True,reproject_words_dimension=256)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated method __init__. (The functionality of this class is moved to 'DocumentRNNEmbeddings') -- Deprecated since version 0.4.
Building and Training
In [63]:
# Load NLP Pkgs from flair.models import TextClassifier from flair.trainers import ModelTrainer
In [64]:
# Classifier with CSV dataset clf = TextClassifier(document_embeddings,label_dictionary=label_dict_csv)
In [68]:
# Classifier with FastText Format clf2 = TextClassifier(document_embeddings,label_dictionary=label_dict_fst)
In [69]:
# Training # Init trainer = ModelTrainer(clf2,corpus_fst)
In [70]:
# Fit/Training with Dataset trainer.train('data_fst/',max_epochs=2)
2020-10-04 14:07:16,724 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:07:16,727 Model: "TextClassifier( (document_embeddings): DocumentLSTMEmbeddings( (embeddings): StackedEmbeddings( (list_embedding_0): FlairEmbeddings( (lm): LanguageModel( (drop): Dropout(p=0.25, inplace=False) (encoder): Embedding(275, 100) (rnn): LSTM(100, 1024) (decoder): Linear(in_features=1024, out_features=275, bias=True) ) ) (list_embedding_1): FlairEmbeddings( (lm): LanguageModel( (drop): Dropout(p=0.25, inplace=False) (encoder): Embedding(275, 100) (rnn): LSTM(100, 1024) (decoder): Linear(in_features=1024, out_features=275, bias=True) ) ) ) (word_reprojection_map): Linear(in_features=2048, out_features=256, bias=True) (rnn): GRU(256, 512) (dropout): Dropout(p=0.5, inplace=False) ) (decoder): Linear(in_features=512, out_features=2, bias=True) (loss_function): CrossEntropyLoss() (beta): 1.0 (weights): None (weight_tensor) None )" 2020-10-04 14:07:16,729 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:07:16,731 Corpus: "Corpus: 2801 train + 935 dev + 932 test sentences" 2020-10-04 14:07:16,733 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:07:16,735 Parameters: 2020-10-04 14:07:16,737 - learning_rate: "0.1" 2020-10-04 14:07:16,741 - mini_batch_size: "32" 2020-10-04 14:07:16,743 - patience: "3" 2020-10-04 14:07:16,744 - anneal_factor: "0.5" 2020-10-04 14:07:16,745 - max_epochs: "2" 2020-10-04 14:07:16,746 - shuffle: "True" 2020-10-04 14:07:16,748 - train_with_dev: "False" 2020-10-04 14:07:16,751 - batch_growth_annealing: "False" 2020-10-04 14:07:16,752 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:07:16,756 Model training base path: "data_fst" 2020-10-04 14:07:16,758 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:07:16,761 Device: cpu 2020-10-04 14:07:16,763 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:07:16,765 Embeddings storage mode: cpu 2020-10-04 14:07:16,767 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:07:35,472 epoch 1 - iter 8/88 - loss 0.51544183 - samples/sec: 13.93 - lr: 0.100000 2020-10-04 14:07:53,353 epoch 1 - iter 16/88 - loss 0.48173103 - samples/sec: 14.33 - lr: 0.100000 2020-10-04 14:08:10,173 epoch 1 - iter 24/88 - loss 0.48178329 - samples/sec: 15.59 - lr: 0.100000 2020-10-04 14:08:26,789 epoch 1 - iter 32/88 - loss 0.48624196 - samples/sec: 15.42 - lr: 0.100000 2020-10-04 14:08:43,764 epoch 1 - iter 40/88 - loss 0.48142542 - samples/sec: 15.11 - lr: 0.100000 2020-10-04 14:09:00,269 epoch 1 - iter 48/88 - loss 0.46682192 - samples/sec: 15.52 - lr: 0.100000 2020-10-04 14:09:17,383 epoch 1 - iter 56/88 - loss 0.46152312 - samples/sec: 14.98 - lr: 0.100000 2020-10-04 14:09:34,412 epoch 1 - iter 64/88 - loss 0.46239550 - samples/sec: 15.27 - lr: 0.100000 2020-10-04 14:09:50,761 epoch 1 - iter 72/88 - loss 0.45929928 - samples/sec: 15.68 - lr: 0.100000 2020-10-04 14:10:07,515 epoch 1 - iter 80/88 - loss 0.45689389 - samples/sec: 15.30 - lr: 0.100000 2020-10-04 14:10:24,207 epoch 1 - iter 88/88 - loss 0.45517061 - samples/sec: 15.34 - lr: 0.100000 2020-10-04 14:10:24,341 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:10:24,343 EPOCH 1 done: loss 0.4552 - lr 0.1000000 2020-10-04 14:11:24,636 DEV : loss 0.4188002943992615 - score 0.8182 2020-10-04 14:11:25,024 BAD EPOCHS (no improvement): 0 saving best model 2020-10-04 14:11:25,103 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:11:42,894 epoch 2 - iter 8/88 - loss 0.37963553 - samples/sec: 14.60 - lr: 0.100000 2020-10-04 14:12:00,383 epoch 2 - iter 16/88 - loss 0.39962170 - samples/sec: 14.65 - lr: 0.100000 2020-10-04 14:12:17,873 epoch 2 - iter 24/88 - loss 0.39758716 - samples/sec: 14.66 - lr: 0.100000 2020-10-04 14:12:34,445 epoch 2 - iter 32/88 - loss 0.41368278 - samples/sec: 15.48 - lr: 0.100000 2020-10-04 14:12:52,102 epoch 2 - iter 40/88 - loss 0.41964307 - samples/sec: 14.77 - lr: 0.100000 2020-10-04 14:13:08,885 epoch 2 - iter 48/88 - loss 0.41549904 - samples/sec: 15.28 - lr: 0.100000 2020-10-04 14:13:25,909 epoch 2 - iter 56/88 - loss 0.40940068 - samples/sec: 15.05 - lr: 0.100000 2020-10-04 14:13:42,876 epoch 2 - iter 64/88 - loss 0.40874149 - samples/sec: 15.11 - lr: 0.100000 2020-10-04 14:13:59,836 epoch 2 - iter 72/88 - loss 0.41267412 - samples/sec: 15.10 - lr: 0.100000 2020-10-04 14:14:16,749 epoch 2 - iter 80/88 - loss 0.41082591 - samples/sec: 15.16 - lr: 0.100000 2020-10-04 14:14:33,315 epoch 2 - iter 88/88 - loss 0.41240300 - samples/sec: 15.46 - lr: 0.100000 2020-10-04 14:14:33,468 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:14:33,470 EPOCH 2 done: loss 0.4124 - lr 0.1000000 2020-10-04 14:15:32,670 DEV : loss 0.4321114122867584 - score 0.8214 2020-10-04 14:15:33,049 BAD EPOCHS (no improvement): 0 saving best model 2020-10-04 14:15:33,232 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:15:33,235 Testing using best model ... 2020-10-04 14:15:33,237 loading file data_fst/best-model.pt 2020-10-04 14:16:30,423 0.8541 2020-10-04 14:16:30,425 Results: - F-score (micro) 0.8541 - F-score (macro) 0.7223 - Accuracy 0.8541 By class: precision recall f1-score support offensive 0.8965 0.9313 0.9136 772 non_offensive 0.5923 0.4813 0.5310 160 micro avg 0.8541 0.8541 0.8541 932 macro avg 0.7444 0.7063 0.7223 932 weighted avg 0.8443 0.8541 0.8479 932 samples avg 0.8541 0.8541 0.8541 932 2020-10-04 14:16:30,427 ----------------------------------------------------------------------------------------------------
Out[70]:
{'dev_loss_history': [0.4188002943992615, 0.4321114122867584], 'dev_score_history': [0.8182, 0.8214], 'test_score': 0.8541, 'train_loss_history': [0.4551706066863103, 0.4124030032279817]}
In [71]:
# Making Prediciton # Load Saved Model and Predict new_clf = TextClassifier.load('data_fst/best-model.pt')
2020-10-04 14:19:05,861 loading file data_fst/best-model.pt
In [72]:
from flair.data import Sentence
In [73]:
# Sample Sentence ex1 = Sentence("That girl is a bitch") ex2 = Sentence("This is a good material")
In [74]:
# Apply our model new_clf.predict(ex1)
In [75]:
ex1.labels
Out[75]:
[offensive (0.7436)]
In [76]:
new_clf.predict(ex2)
In [77]:
ex2.labels
Out[77]:
[non_offensive (0.7178)]
In [78]:
# Plot Loss Curve from flair.visual.training_curves import Plotter
In [79]:
plotter = Plotter() plotter.plot_training_curves('data_fst/loss.tsv') plotter.plot_weights('data_fst/weights.txt')
2020-10-04 14:23:07,751 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:23:07,753 WARNING: No LOSS found for test split in this data. 2020-10-04 14:23:07,754 Are you sure you want to plot LOSS and not another value? 2020-10-04 14:23:07,755 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:23:07,791 ---------------------------------------------------------------------------------------------------- 2020-10-04 14:23:07,792 WARNING: No F1 found for test split in this data. 2020-10-04 14:23:07,793 Are you sure you want to plot F1 and not another value? 2020-10-04 14:23:07,794 ----------------------------------------------------------------------------------------------------
No handles with labels found to put in legend.
Loss and F1 plots are saved in data_fst/training.png
Weights plots are saved in data_fst/weights.png
There is also a video tutorial that you can also check out for how it was done
Thanks For Your Time
Jesus Saves
By Jesse E.Agbe(JCharis)