Natural Language Processing with TextaCy & SpaCy
Spacy is a very high performance NLP library for doing several tasks of NLP with ease and speed. Let us explore another library built on top of SpaCy called TextaCy.
Textacy
- Textacy is a Python library for performing higher-level natural language processing (NLP) tasks,
- built on the high-performance Spacy library.
- Textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text.
- Uses
- Text preprocessing
- Keyword in Context
- Topic modeling
- Information Extraction
- Keyterm extraction,
- Text and Readability statistics,
- Emotional valence analysis,
- Quotation attribution
Installation
- pip install textacy
- conda install -c conda-forge textacy
NB: In case you are having issues with installing on windows you can use conda instead of pip
Downloading Dataset
- python -m textacy download capital_words
For Language Detection
- pip install textacy[lang]
- pip install cld2-cffi
Let us get started…
In [1]:
# Loading Packages
import textacy
In [2]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."
Text Preprocessing With Textacy
- textacy.preprocess_text()
- textacy.preprocess.
- Punctuation Lowercase
- Urls
- Phone numbers
- Currency
- Emails
In [3]:
raw_text = """ The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars.
Don’t document the problem, fix it.This is from https://twitter.com/codewisdom?lang=en. """
In [4]:
# Removing Punctuation and Uppercase
textacy.preprocess.remove_punct(raw_text)
Out[4]:
In [5]:
# Removing urls
textacy.preprocess.replace_urls(raw_text,replace_with='TWITTER')
Out[5]:
In [6]:
# Replacing Currency Symbols
textacy.preprocess.replace_currency_symbols(raw_text,replace_with='USD')
Out[6]:
In [7]:
# Replacing Emails
In [8]:
# Preprocess All
textacy.preprocess_text(raw_text,lowercase=True,no_punct=True,no_urls=True)
Out[8]:
In [9]:
# Processing a Text on a File
textacy.preprocess_text(open("sample.txt").read(),lowercase=True)
Out[9]:
Reading a Text or A Document
- textacy.Doc(your_text)
- textacy.io.read_text(your_text)
In [10]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."
In [11]:
# With Doc
# Requires Language Pkg Model
docx_textacy = textacy.Doc(example)
In [12]:
docx_textacy
Out[12]:
In [13]:
type(docx_textacy)
Out[13]:
In [14]:
# Using spacy
import spacy
nlp = spacy.load('en')
In [15]:
docx_spacy = nlp(example)
In [16]:
docx_spacy
Out[16]:
In [17]:
type(docx_spacy)
Out[17]:
In [18]:
#### Both Are of the Type Doc
Reading A File
In [19]:
# Method 1
file_textacy = textacy.Doc(open("example.txt").read())
In [20]:
file_textacy
Out[20]:
In [21]:
# Method 2
# Creates a generator
# file_textacy2 = textacy.io.read_text('example.txt')
file_textacy2 = textacy.io.read_text('example.txt',lines=True)
In [22]:
type(file_textacy2)
Out[22]:
In [23]:
for text in file_textacy2:
docx_file = textacy.Doc(text)
print(docx_file)
Working With Multiple Text Documents
- textacy.io.read_text(text,lines=True)
- textacy.io.read_json(text,lines=True)
- textacy.io.csv.read_csv(text)
Analysis of Text
- Tokenization
- Ngrams
- Named Entities
- Key Terms & Text Rank
- Basic Counts/Frequency & Stats
- Bag of Terms
In [24]:
docx_spacy
Out[24]:
In [25]:
# Using SpaCy Named Entities Recognition
[ (entity.text,entity.label_) for entity in docx_spacy.ents ]
Out[25]:
In [26]:
# Using Textacy Named Entity Extraction
list(textacy.extract.named_entities(docx_textacy))
Out[26]:
In [27]:
# NGrams with Textacy
# NB SpaCy method would be to use noun Phrases
# Tri Grams
list(textacy.extract.ngrams(docx_textacy,3))
Out[27]:
Info Extraction/Summary
- semistructured_statements
In [39]:
docx = textacy.Doc(open("example1.txt").read())
In [41]:
# Extract Points
statements = textacy.extract.semistructured_statements(docx,"Jerusalem")
In [42]:
statements
Out[42]:
In [43]:
# Prints Results
print("This text is about: ")
for statement in statements:
subject,verb,point = statement
print(f':{point}')
In [ ]:
Key Terms and Text Rank
- Textacy
- PyTextRank
In [44]:
# Load Keyterms for TextRank & Srank
import textacy.keyterms
# You can lemmatize it or normalize it for better result
In [45]:
mylemma = [(token.lemma_) for token in docx_textacy]
In [46]:
mylemma
You can also get the video tutorial here
By Jesse JCharis
Jesus Saves