Natural language Processing With SpaCy and Python

In this lesson ,we will be looking at SpaCy an industrial length Natural language processing library . SpaCy was developed by Explosion.ai (Matthew Honnibal and his team).

SpaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more

So let us see how to install SpaCy on our system.

Installing the Library

sudo pip install spacy
sudo python -m spacy download en
sudo python -m spacy download fr

Installing using Conda

conda install -c conda-forge spacy
sudo python -m spacy download en
sudo python -m spacy download fr

Installing On Windows using Conda

conda config-add channel conda-forge
conda update anaconda
conda install tqdm
conda install -c conda-forge spacy
sudo python -m spacy download en

For Downloading the Models of other languages

sudo python -m spacy download de # German
sudo python -m spacy download es # Spanish
sudo python -m spacy download xx # Multilanguage

In [2]:

# Loading the package
import spacy
nlp = spacy.load("en")
#nlp = en_core_web_sm.load()

Reading A Document or Text

In [3]:

ex1 = nlp("Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob")

In [4]:

# Reading the text /tokens
ex1.text

Out[4]:

'Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob'

In [5]:

# Reading a file
myfile = open("samplefile.txt").read()

In [6]:

doc_file = nlp(myfile)

In [7]:

doc_file.text

Out[7]:

'The best error message is the one that never shows up.\nYou Learn More From Failure Than From Success. \nThe purpose of software engineering is to control complexity, not to create it'

Sentence Tokens

Tokenization == Splitting or segmenting the text into sentences or tokens
Tokenization: Segmenting text into words, punctuations marks etc.
.sent

In [8]:

# List of Sentences in File
list(doc_file.sents)

Out[8]:

[The best error message is the one that never shows up.,
 You Learn More From Failure,
 Than From Success. ,
 The purpose of software engineering is to control complexity, not to create it]

In [9]:

# Sentence Tokens
for sentence in doc_file.sents:
    print(sentence)

The best error message is the one that never shows up.

You Learn More From Failure
Than From Success. 

The purpose of software engineering is to control complexity, not to create it

Word Tokens

Splitting or segmenting the text into words
.text

In [10]:

docx = nlp("Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob")

In [11]:

# Word Tokens
for token in docx:
    print(token.text)

Hello
world
,
this
is
SpaCy
one
of
the
fastest
Natural
Language
Processing
tools
just
like
NLTK
and
TextBlob

In [12]:

# List of Word Tokens
[token.text for token in docx]

Out[12]:

['Hello',
 'world',
 ',',
 'this',
 'is',
 'SpaCy',
 'one',
 'of',
 'the',
 'fastest',
 'Natural',
 'Language',
 'Processing',
 'tools',
 'just',
 'like',
 'NLTK',
 'and',
 'TextBlob']

In [14]:

# Similar to splitting on spaces
simpletext = "Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob"
simpletext.split(" ")

Out[14]:

['Hello',
 'world,',
 'this',
 'is',
 'SpaCy',
 'one',
 'of',
 'the',
 'fastest',
 'Natural',
 'Language',
 'Processing',
 'tools',
 'just',
 'like',
 'NLTK',
 'and',
 'TextBlob']

More about words

.shape_ ==> for shape of word eg. capital,lowercase,etc
.is_alpha ==> returns boolean(true or false) if word is alphabet
.is_stop ==> returns boolean(true or false) if word is a stop word

In [15]:

docx

Out[15]:

Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob

In [17]:

# Word Shape
for word in docx:
    print("Tokens =>",word.text)
    print("Shape of Token =>",word.shape_)
    print("Is is an alphabet =>",word.is_alpha)
    print("Is it a Stopword =>",word.is_stop)

Tokens => Hello
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => world
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => ,
Shape of Token => ,
Is is an alphabet => False
Is it a Stopword => False
Tokens => this
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => is
Shape of Token => xx
Is is an alphabet => True
Is it a Stopword => True
Tokens => SpaCy
Shape of Token => XxxXx
Is is an alphabet => True
Is it a Stopword => False
Tokens => one
Shape of Token => xxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => of
Shape of Token => xx
Is is an alphabet => True
Is it a Stopword => True
Tokens => the
Shape of Token => xxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => fastest
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => Natural
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => Language
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => Processing
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => tools
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => just
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => like
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => NLTK
Shape of Token => XXXX
Is is an alphabet => True
Is it a Stopword => False
Tokens => and
Shape of Token => xxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => TextBlob
Shape of Token => XxxxXxxx
Is is an alphabet => True
Is it a Stopword => False

Part of Speech Tagging

Part-of-speech: (POS) Tagging Assigning word types to tokens, like verb or noun.

.pos
.pos_ ==> Returns readable string representation of attribute
.tag
.tag_ ==> Returns readable string representation of attribute

In [18]:

docx

Out[18]:

Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob

In [19]:

# Parts of Speech Simple Term
for word in docx:
    print((word,word.pos_,word.pos))

(Hello, 'INTJ', 90)
(world, 'NOUN', 91)
(,, 'PUNCT', 96)
(this, 'DET', 89)
(is, 'VERB', 99)
(SpaCy, 'ADJ', 83)
(one, 'NUM', 92)
(of, 'ADP', 84)
(the, 'DET', 89)
(fastest, 'ADJ', 83)
(Natural, 'PROPN', 95)
(Language, 'PROPN', 95)
(Processing, 'PROPN', 95)
(tools, 'NOUN', 91)
(just, 'ADV', 85)
(like, 'ADP', 84)
(NLTK, 'PROPN', 95)
(and, 'CCONJ', 88)
(TextBlob, 'PROPN', 95)

In [20]:

# Parts of Speech Abbreviation of Tag
for word in docx:
    print((word,word.tag_))

(Hello, 'UH')
(world, 'NN')
(,, ',')
(this, 'DT')
(is, 'VBZ')
(SpaCy, 'JJ')
(one, 'CD')
(of, 'IN')
(the, 'DT')
(fastest, 'JJS')
(Natural, 'NNP')
(Language, 'NNP')
(Processing, 'NNP')
(tools, 'NNS')
(just, 'RB')
(like, 'IN')
(NLTK, 'NNP')
(and, 'CC')
(TextBlob, 'NNP')

In [21]:

# Parts of Speech Abbreviation of Tag and 
for word in ex1:
    print((word,word.tag_,word.tag))

(Hello, 'UH', 3252815442139690129)
(world, 'NN', 15308085513773655218)
(,, ',', 2593208677638477497)
(this, 'DT', 15267657372422890137)
(is, 'VBZ', 13927759927860985106)
(SpaCy, 'JJ', 10554686591937588953)
(one, 'CD', 8427216679587749980)
(of, 'IN', 1292078113972184607)
(the, 'DT', 15267657372422890137)
(fastest, 'JJS', 14753207560692742245)
(Natural, 'NNP', 15794550382381185553)
(Language, 'NNP', 15794550382381185553)
(Processing, 'NNP', 15794550382381185553)
(tools, 'NNS', 783433942507015291)
(just, 'RB', 164681854541413346)
(like, 'IN', 1292078113972184607)
(NLTK, 'NNP', 15794550382381185553)
(and, 'CC', 17571114184892886314)
(TextBlob, 'NNP', 15794550382381185553)

If you want to know the meaning of the pos abbreviation

spacy.explain(‘DT’)

In [22]:

spacy.explain('NN')

Out[22]:

'noun, singular or mass'

Syntactic Dependency

It helps us to know the relation between tokens

In [28]:

docx2 = nlp(u"This tool was written by Matt Honnibal, a computer linguist in Cython.")

In [29]:

for word in docx2:
    print((word,word.tag_,word.dep_))

(This, 'DT', 'det')
(tool, 'NN', 'nsubjpass')
(was, 'VBD', 'auxpass')
(written, 'VBN', 'ROOT')
(by, 'IN', 'agent')
(Matt, 'NNP', 'compound')
(Honnibal, 'NNP', 'pobj')
(,, ',', 'punct')
(a, 'DT', 'det')
(computer, 'NN', 'compound')
(linguist, 'NN', 'appos')
(in, 'IN', 'prep')
(Cython, 'NNP', 'pobj')
(., '.', 'punct')

Visualizing Dependency using displaCy

from spacy import displacy
displacy.serve()
displacy.render(jupyter=True) # for jupyter notebook

In [30]:

docx2

Out[30]:

This tool was written by Matt Honnibal, a computer linguist in Cython.

In [31]:

# To dispay the dependences and any other visualization
from spacy import displacy

In [32]:

# For Jupyter Notebooks you can set jupter=True to render it properly
displacy.render(ex1,style='dep',jupyter=True)

In [ ]:

# Visualizing Named Entity Recognistion 
#displacy.render(ex1,style='ent',jupyter=True,options={'distance':140})
displacy.render(ex1,style='ent',jupyter=True)

Named Entity Recognition or Detection

Classifying a text into predefined categories or real world object entities.
.ents
.label_

In [34]:

wikitext = nlp(u"Bill Gates is an American business magnate, investor, author, humanitarian, and principal founder of Microsoft Corporation")

In [35]:

wikitext2 = nlp(u"Linus Benedict Torvalds is a Finnish-American software engineer who is the creator, and for a long time, principal developer of the Linux kernel, which became the kernel for operating systems such as the Linux operating systems, Android, and Chrome OS.")

In [36]:

for entity in wikitext.ents:
    print(entity.text,entity.label_)

Bill Gates PERSON
American NORP
Microsoft Corporation ORG

In [37]:

# Visualize With DiSplaCy
displacy.render(wikitext,style='ent',jupyter=True)

is an business magnate, investor, author, humanitarian, and principal founder of

In [38]:

# Visualize With DiSplaCy
displacy.render(wikitext2,style='ent',jupyter=True)

is a -American software engineer who is the creator, and for a long time, principal developer of the kernel, which became the kernel for operating systems such as the operating systems, , and Chrome OS.

In [ ]:

In [39]:

excercise1 = nlp(u"All the faith he had had had had no effect on the outcome of his life")
#the first is a modifier while the second is the main verb of the sentence
excercise2 = nlp("The man the professor the student has studies Rome.")
#The student has the professor who knows the man who studies ancient Rome

In [41]:

# Parts of speech for Confusing words
for word in excercise1:
    print((word.text,word.pos_,word.tag_,word.dep_))

('All', 'ADJ', 'PDT', 'predet')
('the', 'DET', 'DT', 'det')
('faith', 'NOUN', 'NN', 'ROOT')
('he', 'PRON', 'PRP', 'nsubj')
('had', 'VERB', 'VBD', 'aux')
('had', 'VERB', 'VBN', 'aux')
('had', 'VERB', 'VBN', 'aux')
('had', 'VERB', 'VBN', 'relcl')
('no', 'DET', 'DT', 'det')
('effect', 'NOUN', 'NN', 'dobj')
('on', 'ADP', 'IN', 'prep')
('the', 'DET', 'DT', 'det')
('outcome', 'NOUN', 'NN', 'pobj')
('of', 'ADP', 'IN', 'prep')
('his', 'ADJ', 'PRP$', 'poss')
('life', 'NOUN', 'NN', 'pobj')

In [42]:

displacy.render(excercise1,style='dep',jupyter=True)

In [43]:

# Parts of speech for Confusing words
for word in excercise2:
    print((word.text,word.pos_,word.tag_,word.dep_))

('The', 'DET', 'DT', 'det')
('man', 'NOUN', 'NN', 'nsubj')
('the', 'DET', 'DT', 'det')
('professor', 'NOUN', 'NN', 'appos')
('the', 'DET', 'DT', 'det')
('student', 'NOUN', 'NN', 'appos')
('has', 'VERB', 'VBZ', 'ROOT')
('studies', 'NOUN', 'NNS', 'dobj')
('Rome', 'PROPN', 'NNP', 'dobj')
('.', 'PUNCT', '.', 'punct')

In [45]:

displacy.render(excercise2,style='dep',jupyter=True)

In [46]:

displacy.render(excercise2,style='ent',jupyter=True)

The man the professor the student has studies .

Text Normalization and Word Inflection

Word inflection == syntactic differences between word forms are called
Reducing a word to its base/root form
Lemmatization **
- a word based on its intended meaning
Stemming
- Cutting of the prefixes/suffices to reduce a word to base form
Word Shape Analysis

In [47]:

## Lemmatization  
docx_lemma = nlp("studying student study studies studio studious")

In [53]:

for token in docx_lemma:
    print(token.text ,"=>",token.lemma_,token.pos_)

studying => study VERB
student => student NOUN
study => study NOUN
studies => study NOUN
studio => studio NOUN
studious => studious ADJ

In [56]:

docx_lemma1 = nlp("better goods run running die dies dye dying dice")

In [57]:

for word in docx_lemma1:
    print(word.text,"=>",word.lemma_,word.pos_)

better => good ADJ
goods => good NOUN
run => run VERB
running => run VERB
die => die NOUN
dies => die VERB
dye => dye NOUN
dying => die VERB
dice => dice NOUN

Word Vectors and Similarity

object1.similarity(object2)

In [58]:

# Species
doc1 = nlp(u"wolf")
doc2 = nlp(u"dog")

In [59]:

# Similarity of object
doc1.similarity(doc2)

Out[59]:

0.6759108958707175

In [60]:

# Synonmys
syn1 = nlp("smart")
syn2 = nlp("clever")

In [61]:

# Similarity of words
syn1.similarity(syn2)

Out[61]:

0.8051825859624082

In [62]:

similarword = nlp("wolf dog fish birds")

In [63]:

for token in similarword:
    print(token.text)

wolf
dog
fish
birds

In [64]:

# Similarity Between Tokens
for token1 in similarword:
    for token2 in similarword:
        print((token1.text,token2.text),"similarity=>",token1.similarity(token2))

('wolf', 'wolf') similarity=> 1.0
('wolf', 'dog') similarity=> 0.52425706
('wolf', 'fish') similarity=> 0.3446895
('wolf', 'birds') similarity=> -0.13539252
('dog', 'wolf') similarity=> 0.52425706
('dog', 'dog') similarity=> 1.0
('dog', 'fish') similarity=> 0.5711805
('dog', 'birds') similarity=> 0.061480716
('fish', 'wolf') similarity=> 0.3446895
('fish', 'dog') similarity=> 0.5711805
('fish', 'fish') similarity=> 1.0
('fish', 'birds') similarity=> 0.38496968
('birds', 'wolf') similarity=> -0.13539252
('birds', 'dog') similarity=> 0.061480716
('birds', 'fish') similarity=> 0.38496968
('birds', 'birds') similarity=> 1.0

In [65]:

#[x for b in a for x in b] 
[token1.similarity(token2) for token2 in similarword for token1 in similarword]

Out[65]:

[1.0,
 0.52425706,
 0.3446895,
 -0.13539252,
 0.52425706,
 1.0,
 0.5711805,
 0.061480716,
 0.3446895,
 0.5711805,
 1.0,
 0.38496968,
 -0.13539252,
 0.061480716,
 0.38496968,
 1.0]

In [ ]:

### Using DataFrames

In [67]:

docx_similar = [(token1.text,token2.text,token1.similarity(token2)) for token2 in similarword for token1 in similarword]

In [68]:

import pandas as pd

In [69]:

df = pd.DataFrame(docx_similar)

In [70]:

df.head()

Out[70]:

	0	1	2
0	wolf	wolf	1.000000
1	dog	wolf	0.524257
2	fish	wolf	0.344689
3	birds	wolf	-0.135393
4	wolf	dog	0.524257

In [71]:

df.columns = ["Token1","Token2","Similarity"]

In [72]:

df.head()

Out[72]:

	Token1	Token2	Similarity
0	wolf	wolf	1.000000
1	dog	wolf	0.524257
2	fish	wolf	0.344689
3	birds	wolf	-0.135393
4	wolf	dog	0.524257

In [73]:

# Types
df.dtypes

Out[73]:

Token1         object
Token2         object
Similarity    float64
dtype: object

In [74]:

# Visualization Package with Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [77]:

# Encoding it
df_viz = df.replace({'wolf':1,'dog':2,'fish':3,'birds':4})

In [78]:

df_viz.head()

Out[78]:

	Token1	Token2	Similarity
0	1	1	1.000000
1	2	1	0.524257
2	3	1	0.344689
3	4	1	-0.135393
4	1	2	0.524257

In [79]:

# Plotting with Correlation
plt.figure(figsize=(20,10))
sns.heatmap(df_viz.corr(),annot=True)
plt.show()

In [80]:

# Plotting without correlation
plt.figure(figsize=(20,10))
sns.heatmap(df_viz,annot=True)
plt.show()

Word Analysis

shape of word
is_alpha
is_stop

In [140]:

doc_word = nlp("Hello SpaCy this is an A.I company product created in 2014")

In [144]:

for token in doc_word:
    print(token.text,"=>",token.shape_,"=>",token.is_stop,"=>",token.pos_)

Hello => Xxxxx => False => INTJ
SpaCy => XxxXx => False => ADJ
this => xxxx => True => DET
is => xx => True => VERB
an => xx => True => DET
A.I => X.X => False => PROPN
company => xxxx => False => NOUN
product => xxxx => False => NOUN
created => xxxx => False => VERB
in => xx => True => ADP
2014 => dddd => False => NUM

Noun Chunks

noun + word describing the noun

In [131]:

excercise2

Out[131]:

The man the professor the student has studies Rome.

In [137]:

# Noun Phrase
for noun in excercise2.noun_chunks:
    print(noun)

The man
the professor
the student
studies
Rome

Natural Language Processing With SpaCy