Natural Language Processing with TextaCy & SpaCy

Spacy is a very high performance NLP library for doing several tasks of NLP with ease and speed. Let us explore another library built on top of SpaCy called TextaCy.

Textacy

Textacy is a Python library for performing higher-level natural language processing (NLP) tasks,
built on the high-performance Spacy library.
Textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text.
Uses
- Text preprocessing
- Keyword in Context
- Topic modeling
- Information Extraction
- Keyterm extraction,
- Text and Readability statistics,
- Emotional valence analysis,
- Quotation attribution

Installation

pip install textacy
conda install -c conda-forge textacy

NB: In case you are having issues with installing on windows you can use conda instead of pip

Downloading Dataset

python -m textacy download capital_words

For Language Detection

pip install textacy[lang]
pip install cld2-cffi

Let us get started…

In [1]:

# Loading Packages
import textacy

In [2]:

example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."

Text Preprocessing With Textacy

textacy.preprocess_text()
textacy.preprocess.
- Punctuation Lowercase
- Urls
- Phone numbers
- Currency
- Emails

In [3]:

raw_text = """ The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars.
Don’t document the problem, fix it.This is from https://twitter.com/codewisdom?lang=en. """

In [4]:

# Removing Punctuation and Uppercase
textacy.preprocess.remove_punct(raw_text)

Out[4]:

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for $50 although in Paris it will cost $30 dollars \nDon t document the problem  fix it This is from https   twitter com codewisdom lang=en  '

In [5]:

# Removing urls
textacy.preprocess.replace_urls(raw_text,replace_with='TWITTER')

Out[5]:

' The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars.\nDon’t document the problem, fix it.This is from TWITTER '

In [6]:

# Replacing Currency Symbols
textacy.preprocess.replace_currency_symbols(raw_text,replace_with='USD')

Out[6]:

' The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for USD50 although in Paris it will cost USD30 dollars.\nDon’t document the problem, fix it.This is from https://twitter.com/codewisdom?lang=en. '

In [7]:

# Replacing Emails

In [8]:

# Preprocess All
textacy.preprocess_text(raw_text,lowercase=True,no_punct=True,no_urls=True)

Out[8]:

'the best programs are the ones written when the programmer is supposed to be working on something else mike bought the book for $50 although in paris it will cost $30 dollars don t document the problem fix it this is from url'

In [9]:

# Processing a Text on a File
textacy.preprocess_text(open("sample.txt").read(),lowercase=True)

Out[9]:

'the best programs, are the ones written when the programmer is supposed to be working on something else.mike bought the book for $50 although in paris it will cost $30 dollars.\ndon’t document the problem, fix it.this is from https://twitter.com/codewisdom?lang=en.\ndebuggers don\'t remove bugs. they only show them in slow motion.\n"if at first you don’t succeed, call it version 1.0."\nin theory, there is no difference between theory and practice. but, in practice, there is.\n"commenting your code is like cleaning your bathroom - you never want to do it, but it really does create a more pleasant experience for you and your guests." - ryan campbell\nyour problem is another\'s solution; your solution will be their problem.'

Reading a Text or A Document

textacy.Doc(your_text)
textacy.io.read_text(your_text)

In [10]:

example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."

In [11]:

# With Doc
# Requires Language Pkg Model
docx_textacy = textacy.Doc(example)

In [12]:

docx_textacy

Out[12]:

Doc(82 tokens; "Textacy is a Python library for performing high...")

In [13]:

type(docx_textacy)

Out[13]:

textacy.doc.Doc

In [14]:

# Using spacy
import spacy 
nlp = spacy.load('en')

In [15]:

docx_spacy = nlp(example)

In [16]:

docx_spacy

Out[16]:

Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.

In [17]:

type(docx_spacy)

Out[17]:

spacy.tokens.doc.Doc

In [18]:

#### Both Are of the Type Doc

Reading A File

In [19]:

# Method 1
file_textacy = textacy.Doc(open("example.txt").read())

In [20]:

file_textacy

Out[20]:

Doc(471 tokens; "The nativity of Jesus or birth of Jesus is desc...")

In [21]:

# Method 2
# Creates a generator
# file_textacy2 = textacy.io.read_text('example.txt')
file_textacy2 = textacy.io.read_text('example.txt',lines=True)

In [22]:

type(file_textacy2)

Out[22]:

generator

In [23]:

for text in file_textacy2:
    docx_file = textacy.Doc(text)
    print(docx_file)

Doc(148 tokens; "The nativity of Jesus or birth of Jesus is desc...")

Working With Multiple Text Documents

textacy.io.read_text(text,lines=True)
textacy.io.read_json(text,lines=True)
textacy.io.csv.read_csv(text)

Analysis of Text

Tokenization
Ngrams
Named Entities
Key Terms & Text Rank
Basic Counts/Frequency & Stats
Bag of Terms

In [24]:

docx_spacy

Out[24]:

Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.

In [25]:

# Using SpaCy Named Entities Recognition
[ (entity.text,entity.label_) for entity in docx_spacy.ents ]

Out[25]:

[('NLP', 'ORG'), ('Spacy', 'GPE')]

In [26]:

# Using Textacy Named Entity Extraction
list(textacy.extract.named_entities(docx_textacy))

Out[26]:

[NLP, Spacy]

In [27]:

# NGrams with Textacy
# NB SpaCy method would be to use noun Phrases
# Tri Grams

list(textacy.extract.ngrams(docx_textacy,3))

Out[27]:

[library for performing,
 level natural language,
 natural language processing,
 performance Spacy library,
 With the basics,
 focuses on tasks,
 availability of tokenized,
 emotional valence analysis]

Info Extraction/Summary

semistructured_statements

In [39]:

docx = textacy.Doc(open("example1.txt").read())

In [41]:

# Extract Points
statements = textacy.extract.semistructured_statements(docx,"Jerusalem")

In [42]:

statements

Out[42]:

<generator object semistructured_statements at 0x7f2403592db0>

In [43]:

# Prints Results
print("This text is about: ")
for statement in statements:
    subject,verb,point = statement
    print(f':{point}')

This text is about: 
:the third-holiest city, after Mecca and Medina.[26][27

In [ ]:

Key Terms and Text Rank

Textacy
PyTextRank

In [44]:

# Load Keyterms for TextRank & Srank
import textacy.keyterms
# You can lemmatize it or normalize it for better result

In [45]:

mylemma = [(token.lemma_) for token in docx_textacy]

In [46]:

mylemma

You can also get the video tutorial here

Thanks Alot ,Stay Blessed

By Jesse JCharis

Jesus Saves