Nlp With SpaCy

Natural Language Processing With SpaCy

Natural language Processing With SpaCy and Python

In this lesson ,we will be looking at SpaCy an industrial length Natural language processing library . SpaCy was developed  by Explosion.ai (Matthew Honnibal and his team).

SpaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more

So let us see how to install SpaCy on our system.

Installing the Library

  • sudo pip install spacy
  • sudo python -m spacy download en
  • sudo python -m spacy download fr

Installing using Conda

  • conda install -c conda-forge spacy
  • sudo python -m spacy download en
  • sudo python -m spacy download fr

Installing On Windows using Conda

  • conda config-add channel conda-forge
  • conda update anaconda
  • conda install tqdm
  • conda install -c conda-forge spacy
  • sudo python -m spacy download en

For Downloading the Models of other languages

  • sudo python -m spacy download de # German
  • sudo python -m spacy download es # Spanish
  • sudo python -m spacy download xx # Multilanguage
In [2]:
# Loading the package
import spacy
nlp = spacy.load("en")
#nlp = en_core_web_sm.load()

Reading A Document or Text

In [3]:
ex1 = nlp("Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob")
In [4]:
# Reading the text /tokens
ex1.text
Out[4]:
'Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob'
In [5]:
# Reading a file
myfile = open("samplefile.txt").read()
In [6]:
doc_file = nlp(myfile)
In [7]:
doc_file.text
Out[7]:
'The best error message is the one that never shows up.\nYou Learn More From Failure Than From Success. \nThe purpose of software engineering is to control complexity, not to create it'

Sentence Tokens

  • Tokenization == Splitting or segmenting the text into sentences or tokens
  • Tokenization: Segmenting text into words, punctuations marks etc.
  • .sent
In [8]:
# List of Sentences in File
list(doc_file.sents)
Out[8]:
[The best error message is the one that never shows up.,
 You Learn More From Failure,
 Than From Success. ,
 The purpose of software engineering is to control complexity, not to create it]
In [9]:
# Sentence Tokens
for sentence in doc_file.sents:
    print(sentence)
The best error message is the one that never shows up.

You Learn More From Failure
Than From Success. 

The purpose of software engineering is to control complexity, not to create it

Word Tokens

  • Splitting or segmenting the text into words
  • .text
In [10]:
docx = nlp("Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob")
In [11]:
# Word Tokens
for token in docx:
    print(token.text)
Hello
world
,
this
is
SpaCy
one
of
the
fastest
Natural
Language
Processing
tools
just
like
NLTK
and
TextBlob
In [12]:
# List of Word Tokens
[token.text for token in docx]
Out[12]:
['Hello',
 'world',
 ',',
 'this',
 'is',
 'SpaCy',
 'one',
 'of',
 'the',
 'fastest',
 'Natural',
 'Language',
 'Processing',
 'tools',
 'just',
 'like',
 'NLTK',
 'and',
 'TextBlob']
In [14]:
# Similar to splitting on spaces
simpletext = "Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob"
simpletext.split(" ")
Out[14]:
['Hello',
 'world,',
 'this',
 'is',
 'SpaCy',
 'one',
 'of',
 'the',
 'fastest',
 'Natural',
 'Language',
 'Processing',
 'tools',
 'just',
 'like',
 'NLTK',
 'and',
 'TextBlob']

More about words

  • .shape_ ==> for shape of word eg. capital,lowercase,etc
  • .is_alpha ==> returns boolean(true or false) if word is alphabet
  • .is_stop ==> returns boolean(true or false) if word is a stop word
In [15]:
docx
Out[15]:
Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob
In [17]:
# Word Shape
for word in docx:
    print("Tokens =>",word.text)
    print("Shape of Token =>",word.shape_)
    print("Is is an alphabet =>",word.is_alpha)
    print("Is it a Stopword =>",word.is_stop)
Tokens => Hello
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => world
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => ,
Shape of Token => ,
Is is an alphabet => False
Is it a Stopword => False
Tokens => this
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => is
Shape of Token => xx
Is is an alphabet => True
Is it a Stopword => True
Tokens => SpaCy
Shape of Token => XxxXx
Is is an alphabet => True
Is it a Stopword => False
Tokens => one
Shape of Token => xxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => of
Shape of Token => xx
Is is an alphabet => True
Is it a Stopword => True
Tokens => the
Shape of Token => xxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => fastest
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => Natural
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => Language
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => Processing
Shape of Token => Xxxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => tools
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => just
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => like
Shape of Token => xxxx
Is is an alphabet => True
Is it a Stopword => False
Tokens => NLTK
Shape of Token => XXXX
Is is an alphabet => True
Is it a Stopword => False
Tokens => and
Shape of Token => xxx
Is is an alphabet => True
Is it a Stopword => True
Tokens => TextBlob
Shape of Token => XxxxXxxx
Is is an alphabet => True
Is it a Stopword => False

Part of Speech Tagging

Part-of-speech: (POS) Tagging Assigning word types to tokens, like verb or noun.
  • .pos
  • .pos_ ==> Returns readable string representation of attribute
  • .tag
  • .tag_ ==> Returns readable string representation of attribute
In [18]:
docx
Out[18]:
Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob
In [19]:
# Parts of Speech Simple Term
for word in docx:
    print((word,word.pos_,word.pos))
(Hello, 'INTJ', 90)
(world, 'NOUN', 91)
(,, 'PUNCT', 96)
(this, 'DET', 89)
(is, 'VERB', 99)
(SpaCy, 'ADJ', 83)
(one, 'NUM', 92)
(of, 'ADP', 84)
(the, 'DET', 89)
(fastest, 'ADJ', 83)
(Natural, 'PROPN', 95)
(Language, 'PROPN', 95)
(Processing, 'PROPN', 95)
(tools, 'NOUN', 91)
(just, 'ADV', 85)
(like, 'ADP', 84)
(NLTK, 'PROPN', 95)
(and, 'CCONJ', 88)
(TextBlob, 'PROPN', 95)
In [20]:
# Parts of Speech Abbreviation of Tag
for word in docx:
    print((word,word.tag_))
(Hello, 'UH')
(world, 'NN')
(,, ',')
(this, 'DT')
(is, 'VBZ')
(SpaCy, 'JJ')
(one, 'CD')
(of, 'IN')
(the, 'DT')
(fastest, 'JJS')
(Natural, 'NNP')
(Language, 'NNP')
(Processing, 'NNP')
(tools, 'NNS')
(just, 'RB')
(like, 'IN')
(NLTK, 'NNP')
(and, 'CC')
(TextBlob, 'NNP')
In [21]:
# Parts of Speech Abbreviation of Tag and 
for word in ex1:
    print((word,word.tag_,word.tag))
(Hello, 'UH', 3252815442139690129)
(world, 'NN', 15308085513773655218)
(,, ',', 2593208677638477497)
(this, 'DT', 15267657372422890137)
(is, 'VBZ', 13927759927860985106)
(SpaCy, 'JJ', 10554686591937588953)
(one, 'CD', 8427216679587749980)
(of, 'IN', 1292078113972184607)
(the, 'DT', 15267657372422890137)
(fastest, 'JJS', 14753207560692742245)
(Natural, 'NNP', 15794550382381185553)
(Language, 'NNP', 15794550382381185553)
(Processing, 'NNP', 15794550382381185553)
(tools, 'NNS', 783433942507015291)
(just, 'RB', 164681854541413346)
(like, 'IN', 1292078113972184607)
(NLTK, 'NNP', 15794550382381185553)
(and, 'CC', 17571114184892886314)
(TextBlob, 'NNP', 15794550382381185553)
If you want to know the meaning of the pos abbreviation
  • spacy.explain(‘DT’)
In [22]:
spacy.explain('NN')
Out[22]:
'noun, singular or mass'

Syntactic Dependency

  • It helps us to know the relation between tokens
In [28]:
docx2 = nlp(u"This tool was written by Matt Honnibal, a computer linguist in Cython.")
In [29]:
for word in docx2:
    print((word,word.tag_,word.dep_))
(This, 'DT', 'det')
(tool, 'NN', 'nsubjpass')
(was, 'VBD', 'auxpass')
(written, 'VBN', 'ROOT')
(by, 'IN', 'agent')
(Matt, 'NNP', 'compound')
(Honnibal, 'NNP', 'pobj')
(,, ',', 'punct')
(a, 'DT', 'det')
(computer, 'NN', 'compound')
(linguist, 'NN', 'appos')
(in, 'IN', 'prep')
(Cython, 'NNP', 'pobj')
(., '.', 'punct')

Visualizing Dependency using displaCy

  • from spacy import displacy
  • displacy.serve()
  • displacy.render(jupyter=True) # for jupyter notebook
In [30]:
docx2
Out[30]:
This tool was written by Matt Honnibal, a computer linguist in Cython.
In [31]:
# To dispay the dependences and any other visualization
from spacy import displacy
In [32]:
# For Jupyter Notebooks you can set jupter=True to render it properly
displacy.render(ex1,style='dep',jupyter=True)
In [ ]:
# Visualizing Named Entity Recognistion 
#displacy.render(ex1,style='ent',jupyter=True,options={'distance':140})
displacy.render(ex1,style='ent',jupyter=True)

Named Entity Recognition or Detection

  • Classifying a text into predefined categories or real world object entities.
  • .ents
  • .label_
In [34]:
wikitext = nlp(u"Bill Gates is an American business magnate, investor, author, humanitarian, and principal founder of Microsoft Corporation")
In [35]:
wikitext2 = nlp(u"Linus Benedict Torvalds is a Finnish-American software engineer who is the creator, and for a long time, principal developer of the Linux kernel, which became the kernel for operating systems such as the Linux operating systems, Android, and Chrome OS.")
In [36]:
for entity in wikitext.ents:
    print(entity.text,entity.label_)
Bill Gates PERSON
American NORP
Microsoft Corporation ORG
In [37]:
# Visualize With DiSplaCy
displacy.render(wikitext,style='ent',jupyter=True)
is an business magnate, investor, author, humanitarian, and principal founder of
In [38]:
# Visualize With DiSplaCy
displacy.render(wikitext2,style='ent',jupyter=True)
is a -American software engineer who is the creator, and for a long time, principal developer of the kernel, which became the kernel for operating systems such as the operating systems, , and Chrome OS.
In [ ]:

In [39]:
excercise1 = nlp(u"All the faith he had had had had no effect on the outcome of his life")
#the first is a modifier while the second is the main verb of the sentence
excercise2 = nlp("The man the professor the student has studies Rome.")
#The student has the professor who knows the man who studies ancient Rome
In [41]:
# Parts of speech for Confusing words
for word in excercise1:
    print((word.text,word.pos_,word.tag_,word.dep_))
('All', 'ADJ', 'PDT', 'predet')
('the', 'DET', 'DT', 'det')
('faith', 'NOUN', 'NN', 'ROOT')
('he', 'PRON', 'PRP', 'nsubj')
('had', 'VERB', 'VBD', 'aux')
('had', 'VERB', 'VBN', 'aux')
('had', 'VERB', 'VBN', 'aux')
('had', 'VERB', 'VBN', 'relcl')
('no', 'DET', 'DT', 'det')
('effect', 'NOUN', 'NN', 'dobj')
('on', 'ADP', 'IN', 'prep')
('the', 'DET', 'DT', 'det')
('outcome', 'NOUN', 'NN', 'pobj')
('of', 'ADP', 'IN', 'prep')
('his', 'ADJ', 'PRP$', 'poss')
('life', 'NOUN', 'NN', 'pobj')
In [42]:
displacy.render(excercise1,style='dep',jupyter=True)
In [43]:
# Parts of speech for Confusing words
for word in excercise2:
    print((word.text,word.pos_,word.tag_,word.dep_))
('The', 'DET', 'DT', 'det')
('man', 'NOUN', 'NN', 'nsubj')
('the', 'DET', 'DT', 'det')
('professor', 'NOUN', 'NN', 'appos')
('the', 'DET', 'DT', 'det')
('student', 'NOUN', 'NN', 'appos')
('has', 'VERB', 'VBZ', 'ROOT')
('studies', 'NOUN', 'NNS', 'dobj')
('Rome', 'PROPN', 'NNP', 'dobj')
('.', 'PUNCT', '.', 'punct')
In [45]:
displacy.render(excercise2,style='dep',jupyter=True)
SVG Image
In [46]:
displacy.render(excercise2,style='ent',jupyter=True)
The man the professor the student has studies .

Text Normalization and Word Inflection

  • Word inflection == syntactic differences between word forms are called
  • Reducing a word to its base/root form
  • Lemmatization **
    • a word based on its intended meaning
  • Stemming
    • Cutting of the prefixes/suffices to reduce a word to base form
  • Word Shape Analysis
In [47]:
## Lemmatization  
docx_lemma = nlp("studying student study studies studio studious")
In [53]:
for token in docx_lemma:
    print(token.text ,"=>",token.lemma_,token.pos_)
studying => study VERB
student => student NOUN
study => study NOUN
studies => study NOUN
studio => studio NOUN
studious => studious ADJ
In [56]:
docx_lemma1 = nlp("better goods run running die dies dye dying dice")
In [57]:
for word in docx_lemma1:
    print(word.text,"=>",word.lemma_,word.pos_)
better => good ADJ
goods => good NOUN
run => run VERB
running => run VERB
die => die NOUN
dies => die VERB
dye => dye NOUN
dying => die VERB
dice => dice NOUN

Word Vectors and Similarity

  • object1.similarity(object2)
In [58]:
# Species
doc1 = nlp(u"wolf")
doc2 = nlp(u"dog")
In [59]:
# Similarity of object
doc1.similarity(doc2)
Out[59]:
0.6759108958707175
In [60]:
# Synonmys
syn1 = nlp("smart")
syn2 = nlp("clever")
In [61]:
# Similarity of words
syn1.similarity(syn2)
Out[61]:
0.8051825859624082
In [62]:
similarword = nlp("wolf dog fish birds")
In [63]:
for token in similarword:
    print(token.text)
wolf
dog
fish
birds
In [64]:
# Similarity Between Tokens
for token1 in similarword:
    for token2 in similarword:
        print((token1.text,token2.text),"similarity=>",token1.similarity(token2))
('wolf', 'wolf') similarity=> 1.0
('wolf', 'dog') similarity=> 0.52425706
('wolf', 'fish') similarity=> 0.3446895
('wolf', 'birds') similarity=> -0.13539252
('dog', 'wolf') similarity=> 0.52425706
('dog', 'dog') similarity=> 1.0
('dog', 'fish') similarity=> 0.5711805
('dog', 'birds') similarity=> 0.061480716
('fish', 'wolf') similarity=> 0.3446895
('fish', 'dog') similarity=> 0.5711805
('fish', 'fish') similarity=> 1.0
('fish', 'birds') similarity=> 0.38496968
('birds', 'wolf') similarity=> -0.13539252
('birds', 'dog') similarity=> 0.061480716
('birds', 'fish') similarity=> 0.38496968
('birds', 'birds') similarity=> 1.0
In [65]:
#[x for b in a for x in b] 
[token1.similarity(token2) for token2 in similarword for token1 in similarword]
Out[65]:
[1.0,
 0.52425706,
 0.3446895,
 -0.13539252,
 0.52425706,
 1.0,
 0.5711805,
 0.061480716,
 0.3446895,
 0.5711805,
 1.0,
 0.38496968,
 -0.13539252,
 0.061480716,
 0.38496968,
 1.0]
In [ ]:
### Using DataFrames
In [67]:
docx_similar = [(token1.text,token2.text,token1.similarity(token2)) for token2 in similarword for token1 in similarword]
In [68]:
import pandas as pd
In [69]:
df = pd.DataFrame(docx_similar)
In [70]:
df.head()
Out[70]:
0 1 2
0 wolf wolf 1.000000
1 dog wolf 0.524257
2 fish wolf 0.344689
3 birds wolf -0.135393
4 wolf dog 0.524257
In [71]:
df.columns = ["Token1","Token2","Similarity"]
In [72]:
df.head()
Out[72]:
Token1 Token2 Similarity
0 wolf wolf 1.000000
1 dog wolf 0.524257
2 fish wolf 0.344689
3 birds wolf -0.135393
4 wolf dog 0.524257
In [73]:
# Types
df.dtypes
Out[73]:
Token1         object
Token2         object
Similarity    float64
dtype: object
In [74]:
# Visualization Package with Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [77]:
# Encoding it
df_viz = df.replace({'wolf':1,'dog':2,'fish':3,'birds':4})
In [78]:
df_viz.head()
Out[78]:
Token1 Token2 Similarity
0 1 1 1.000000
1 2 1 0.524257
2 3 1 0.344689
3 4 1 -0.135393
4 1 2 0.524257
In [79]:
# Plotting with Correlation
plt.figure(figsize=(20,10))
sns.heatmap(df_viz.corr(),annot=True)
plt.show()
In [80]:
# Plotting without correlation
plt.figure(figsize=(20,10))
sns.heatmap(df_viz,annot=True)
plt.show()

Word Analysis

  • shape of word
  • is_alpha
  • is_stop
In [140]:
doc_word = nlp("Hello SpaCy this is an A.I company product created in 2014")
In [144]:
for token in doc_word:
    print(token.text,"=>",token.shape_,"=>",token.is_stop,"=>",token.pos_)
Hello => Xxxxx => False => INTJ
SpaCy => XxxXx => False => ADJ
this => xxxx => True => DET
is => xx => True => VERB
an => xx => True => DET
A.I => X.X => False => PROPN
company => xxxx => False => NOUN
product => xxxx => False => NOUN
created => xxxx => False => VERB
in => xx => True => ADP
2014 => dddd => False => NUM

Noun Chunks

  • noun + word describing the noun
In [131]:
excercise2
Out[131]:
The man the professor the student has studies Rome.
In [137]:
# Noun Phrase
for noun in excercise2.noun_chunks:
    print(noun)
The man
the professor
the student
studies
Rome

Leave a Comment

Your email address will not be published. Required fields are marked *