Natural Language Processing with Polyglot
In this tutorial we will be exploring another Python NLP package called Polyglot.
Polyglot is a natural language pipeline that supports massive multilingual applications.Polyglot has a similar learning curve with TextBlob making it easier to pick up quickly if you know TextBlob.
Installation on Unix
- sudo apt-get install python-numpy libicu-dev
- pip install polyglot
Installation on Windows
To install on windows you can either us the normal pip method or try the next method.
- pip install polyglot
Or try using this method
Download the PyCLD2 and PyICU From
- https://www.lfd.uci.edu/~gohlke/pythonlibs/
- pip install pycld2-0.31-cp36-cp36m-win_amd64.whl
- pip install PyICU-1.9.8-cp36-cp36m-win_amd64.whl
- pip install Morfessor-2.0.4-py2.py3-none-any.whl
- git clone https://github.com/aboSamoor/polyglot.git
- cd polyglot
- python setup.py install
You will need to download some models to allow you to do some of the tasks.
- polyglot download embeddings2.en
- polyglot download ner2.en
- polyglot download sentiment2.en
- polyglot download pos2.en
- polyglot download morph2.en
- polyglot download transliteration2.ar
Uses and Application
- Fundamentals or Basics of NLP
- Transliteration
- Named Entity Recognition
- Sentiment Analysis
Let us begin with Polyglot.
Tokenization
- Splitting text into words
In [47]:
# Load packages
import polyglot
from polyglot.text import Text,Word
In [48]:
# Word Tokens
docx = Text(u"He likes reading and painting")
In [49]:
docx.words
Out[49]:
In [50]:
docx2 = Text(u"He exclaimed, 'what're you doing? Reading?'.")
In [51]:
docx2.words
Out[51]:
In [52]:
# Sentence tokens
docx3 = Text(u"He likes reading and painting.He exclaimed, 'what're you doing? Reading?'.")
In [53]:
docx3.sentences
Out[53]:
In [ ]:
Parts of Speech Tagging
- polyglot download embeddings2.la
- pos_tags
In [54]:
docx
Out[54]:
In [55]:
docx.pos_tags
Out[55]:
Language Detection
- polyglot.detect
- language.name
- language.code
In [56]:
docx
Out[56]:
In [57]:
docx.language.name
Out[57]:
In [58]:
docx.language.code
Out[58]:
In [59]:
from polyglot.detect import Detector
In [60]:
en_text = "He is a student "
fr_text = "Il est un étudiant"
ru_text = "Он студент"
In [67]:
detect_en = Detector(en_text)
detect_fr = Detector(fr_text)
detect_ru = Detector(ru_text)
In [63]:
print(detect_en.language)
In [66]:
print(detect_fr.language)
In [68]:
print(detect_ru.language)
In [ ]:
Sentiment Analysis
- polarity
In [71]:
docx4 = Text(u"He hates reading and playing")
In [69]:
docx
Out[69]:
In [70]:
docx.polarity
Out[70]:
In [72]:
docx4.polarity
Out[72]:
Named Entities
- entities
In [73]:
docx5 = Text(u"John Jones was a FBI detector")
In [74]:
docx5.entities
Out[74]:
Morphology
- morpheme is the smallest grammatical unit in a language.
- morpheme may or may not stand alone, word, by definition, is freestanding.
- morphemes
In [75]:
docx6 = Text(u"preprocessing")
In [76]:
docx6.morphemes
Out[76]:
Transliteration
In [77]:
# Load
from polyglot.transliteration import Transliterator
translit = Transliterator(source_lang='en',target_lang='fr')
In [78]:
translit.transliterate(u"working")
Out[78]:
Thanks , Happy Coding
By Jesse E. Agbe (JCharis)
Thank you for this introduction, very helpful! Just one little thing: the sound quality of the video is really bad and it’s practically not possible to understand you properly.