Natural language Processing With SpaCy and Python
In this lesson ,we will be looking at SpaCy an industrial length Natural language processing library . SpaCy was developed by Explosion.ai (Matthew Honnibal and his team).
SpaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more
So let us see how to install SpaCy on our system.
Installing the Library
- sudo pip install spacy
- sudo python -m spacy download en
- sudo python -m spacy download fr
Installing using Conda
- conda install -c conda-forge spacy
- sudo python -m spacy download en
- sudo python -m spacy download fr
Installing On Windows using Conda
- conda config-add channel conda-forge
- conda update anaconda
- conda install tqdm
- conda install -c conda-forge spacy
- sudo python -m spacy download en
For Downloading the Models of other languages
- sudo python -m spacy download de # German
- sudo python -m spacy download es # Spanish
- sudo python -m spacy download xx # Multilanguage
In [2]:
# Loading the package
import spacy
nlp = spacy.load("en")
#nlp = en_core_web_sm.load()
Reading A Document or Text
In [3]:
ex1 = nlp("Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob")
In [4]:
# Reading the text /tokens
ex1.text
Out[4]:
In [5]:
# Reading a file
myfile = open("samplefile.txt").read()
In [6]:
doc_file = nlp(myfile)
In [7]:
doc_file.text
Out[7]:
Sentence Tokens
- Tokenization == Splitting or segmenting the text into sentences or tokens
- Tokenization: Segmenting text into words, punctuations marks etc.
- .sent
In [8]:
# List of Sentences in File
list(doc_file.sents)
Out[8]:
In [9]:
# Sentence Tokens
for sentence in doc_file.sents:
print(sentence)
Word Tokens
- Splitting or segmenting the text into words
- .text
In [10]:
docx = nlp("Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob")
In [11]:
# Word Tokens
for token in docx:
print(token.text)
In [12]:
# List of Word Tokens
[token.text for token in docx]
Out[12]:
In [14]:
# Similar to splitting on spaces
simpletext = "Hello world, this is SpaCy one of the fastest Natural Language Processing tools just like NLTK and TextBlob"
simpletext.split(" ")
Out[14]:
More about words
- .shape_ ==> for shape of word eg. capital,lowercase,etc
- .is_alpha ==> returns boolean(true or false) if word is alphabet
- .is_stop ==> returns boolean(true or false) if word is a stop word
In [15]:
docx
Out[15]:
In [17]:
# Word Shape
for word in docx:
print("Tokens =>",word.text)
print("Shape of Token =>",word.shape_)
print("Is is an alphabet =>",word.is_alpha)
print("Is it a Stopword =>",word.is_stop)
Part of Speech Tagging
Part-of-speech: (POS) Tagging Assigning word types to tokens, like verb or noun.
- .pos
- .pos_ ==> Returns readable string representation of attribute
- .tag
- .tag_ ==> Returns readable string representation of attribute
In [18]:
docx
Out[18]:
In [19]:
# Parts of Speech Simple Term
for word in docx:
print((word,word.pos_,word.pos))
In [20]:
# Parts of Speech Abbreviation of Tag
for word in docx:
print((word,word.tag_))
In [21]:
# Parts of Speech Abbreviation of Tag and
for word in ex1:
print((word,word.tag_,word.tag))
If you want to know the meaning of the pos abbreviation
- spacy.explain(‘DT’)
In [22]:
spacy.explain('NN')
Out[22]:
Syntactic Dependency
- It helps us to know the relation between tokens
In [28]:
docx2 = nlp(u"This tool was written by Matt Honnibal, a computer linguist in Cython.")
In [29]:
for word in docx2:
print((word,word.tag_,word.dep_))
Visualizing Dependency using displaCy
- from spacy import displacy
- displacy.serve()
- displacy.render(jupyter=True) # for jupyter notebook
In [30]:
docx2
Out[30]:
In [31]:
# To dispay the dependences and any other visualization
from spacy import displacy
In [32]:
# For Jupyter Notebooks you can set jupter=True to render it properly
displacy.render(ex1,style='dep',jupyter=True)
In [ ]:
# Visualizing Named Entity Recognistion
#displacy.render(ex1,style='ent',jupyter=True,options={'distance':140})
displacy.render(ex1,style='ent',jupyter=True)
Named Entity Recognition or Detection
- Classifying a text into predefined categories or real world object entities.
- .ents
- .label_
In [34]:
wikitext = nlp(u"Bill Gates is an American business magnate, investor, author, humanitarian, and principal founder of Microsoft Corporation")
In [35]:
wikitext2 = nlp(u"Linus Benedict Torvalds is a Finnish-American software engineer who is the creator, and for a long time, principal developer of the Linux kernel, which became the kernel for operating systems such as the Linux operating systems, Android, and Chrome OS.")
In [36]:
for entity in wikitext.ents:
print(entity.text,entity.label_)
In [37]:
# Visualize With DiSplaCy
displacy.render(wikitext,style='ent',jupyter=True)
In [38]:
# Visualize With DiSplaCy
displacy.render(wikitext2,style='ent',jupyter=True)
In [ ]:
In [39]:
excercise1 = nlp(u"All the faith he had had had had no effect on the outcome of his life")
#the first is a modifier while the second is the main verb of the sentence
excercise2 = nlp("The man the professor the student has studies Rome.")
#The student has the professor who knows the man who studies ancient Rome
In [41]:
# Parts of speech for Confusing words
for word in excercise1:
print((word.text,word.pos_,word.tag_,word.dep_))
In [42]:
displacy.render(excercise1,style='dep',jupyter=True)
In [43]:
# Parts of speech for Confusing words
for word in excercise2:
print((word.text,word.pos_,word.tag_,word.dep_))
In [45]:
displacy.render(excercise2,style='dep',jupyter=True)
In [46]:
displacy.render(excercise2,style='ent',jupyter=True)
Text Normalization and Word Inflection
- Word inflection == syntactic differences between word forms are called
- Reducing a word to its base/root form
- Lemmatization **
- a word based on its intended meaning
- Stemming
- Cutting of the prefixes/suffices to reduce a word to base form
- Word Shape Analysis
In [47]:
## Lemmatization
docx_lemma = nlp("studying student study studies studio studious")
In [53]:
for token in docx_lemma:
print(token.text ,"=>",token.lemma_,token.pos_)
In [56]:
docx_lemma1 = nlp("better goods run running die dies dye dying dice")
In [57]:
for word in docx_lemma1:
print(word.text,"=>",word.lemma_,word.pos_)
Word Vectors and Similarity
- object1.similarity(object2)
In [58]:
# Species
doc1 = nlp(u"wolf")
doc2 = nlp(u"dog")
In [59]:
# Similarity of object
doc1.similarity(doc2)
Out[59]:
In [60]:
# Synonmys
syn1 = nlp("smart")
syn2 = nlp("clever")
In [61]:
# Similarity of words
syn1.similarity(syn2)
Out[61]:
In [62]:
similarword = nlp("wolf dog fish birds")
In [63]:
for token in similarword:
print(token.text)
In [64]:
# Similarity Between Tokens
for token1 in similarword:
for token2 in similarword:
print((token1.text,token2.text),"similarity=>",token1.similarity(token2))
In [65]:
#[x for b in a for x in b]
[token1.similarity(token2) for token2 in similarword for token1 in similarword]
Out[65]:
In [ ]:
### Using DataFrames
In [67]:
docx_similar = [(token1.text,token2.text,token1.similarity(token2)) for token2 in similarword for token1 in similarword]
In [68]:
import pandas as pd
In [69]:
df = pd.DataFrame(docx_similar)
In [70]:
df.head()
Out[70]:
In [71]:
df.columns = ["Token1","Token2","Similarity"]
In [72]:
df.head()
Out[72]:
In [73]:
# Types
df.dtypes
Out[73]:
In [74]:
# Visualization Package with Seaborn
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [77]:
# Encoding it
df_viz = df.replace({'wolf':1,'dog':2,'fish':3,'birds':4})
In [78]:
df_viz.head()
Out[78]:
In [79]:
# Plotting with Correlation
plt.figure(figsize=(20,10))
sns.heatmap(df_viz.corr(),annot=True)
plt.show()
In [80]:
# Plotting without correlation
plt.figure(figsize=(20,10))
sns.heatmap(df_viz,annot=True)
plt.show()
Word Analysis
- shape of word
- is_alpha
- is_stop
In [140]:
doc_word = nlp("Hello SpaCy this is an A.I company product created in 2014")
In [144]:
for token in doc_word:
print(token.text,"=>",token.shape_,"=>",token.is_stop,"=>",token.pos_)
Noun Chunks
- noun + word describing the noun
In [131]:
excercise2
Out[131]:
In [137]:
# Noun Phrase
for noun in excercise2.noun_chunks:
print(noun)