Natural Language Processing with TextaCy & SpaCy

Natural Language Processing with TextaCy & SpaCy

Spacy is a very high performance NLP library for doing several tasks of NLP with ease and speed. Let us explore another library built on top of SpaCy called TextaCy.

Textacy

  • Textacy is a Python library for performing higher-level natural language processing (NLP) tasks,
  • built on the high-performance Spacy library.
  • Textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text.
  •  Uses
    • Text preprocessing
    • Keyword in Context
    • Topic modeling
    • Information Extraction
    • Keyterm extraction,
    • Text and Readability statistics,
    • Emotional valence analysis,
    • Quotation attribution

Installation

  • pip install textacy
  • conda install -c conda-forge textacy

NB: In case you are having issues with installing on windows you can use conda instead of pip

Downloading Dataset

  • python -m textacy download capital_words

For Language Detection

  • pip install textacy[lang]
  • pip install cld2-cffi

Let us get started…

In [1]:
# Loading Packages
import textacy
In [2]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."

Text Preprocessing With Textacy

  • textacy.preprocess_text()
  • textacy.preprocess.
    • Punctuation Lowercase
    • Urls
    • Phone numbers
    • Currency
    • Emails
In [3]:
raw_text = """ The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars.
Don’t document the problem, fix it.This is from https://twitter.com/codewisdom?lang=en. """
In [4]:
# Removing Punctuation and Uppercase
textacy.preprocess.remove_punct(raw_text)
Out[4]:
' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for $50 although in Paris it will cost $30 dollars \nDon t document the problem  fix it This is from https   twitter com codewisdom lang=en  '
In [5]:
# Removing urls
textacy.preprocess.replace_urls(raw_text,replace_with='TWITTER')
Out[5]:
' The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars.\nDon’t document the problem, fix it.This is from TWITTER '
In [6]:
# Replacing Currency Symbols
textacy.preprocess.replace_currency_symbols(raw_text,replace_with='USD')
Out[6]:
' The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for USD50 although in Paris it will cost USD30 dollars.\nDon’t document the problem, fix it.This is from https://twitter.com/codewisdom?lang=en. '
In [7]:
# Replacing Emails
In [8]:
# Preprocess All
textacy.preprocess_text(raw_text,lowercase=True,no_punct=True,no_urls=True)
Out[8]:
'the best programs are the ones written when the programmer is supposed to be working on something else mike bought the book for $50 although in paris it will cost $30 dollars don t document the problem fix it this is from url'
In [9]:
# Processing a Text on a File
textacy.preprocess_text(open("sample.txt").read(),lowercase=True)
Out[9]:
'the best programs, are the ones written when the programmer is supposed to be working on something else.mike bought the book for $50 although in paris it will cost $30 dollars.\ndon’t document the problem, fix it.this is from https://twitter.com/codewisdom?lang=en.\ndebuggers don\'t remove bugs. they only show them in slow motion.\n"if at first you don’t succeed, call it version 1.0."\nin theory, there is no difference between theory and practice. but, in practice, there is.\n"commenting your code is like cleaning your bathroom - you never want to do it, but it really does create a more pleasant experience for you and your guests." - ryan campbell\nyour problem is another\'s solution; your solution will be their problem.'

Reading a Text or A Document

  • textacy.Doc(your_text)
  • textacy.io.read_text(your_text)
In [10]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."
In [11]:
# With Doc
# Requires Language Pkg Model
docx_textacy = textacy.Doc(example)
In [12]:
docx_textacy
Out[12]:
Doc(82 tokens; "Textacy is a Python library for performing high...")
In [13]:
type(docx_textacy)
Out[13]:
textacy.doc.Doc
In [14]:
# Using spacy
import spacy 
nlp = spacy.load('en')
In [15]:
docx_spacy = nlp(example)
In [16]:
docx_spacy
Out[16]:
Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.
In [17]:
type(docx_spacy)
Out[17]:
spacy.tokens.doc.Doc
In [18]:
#### Both Are of the Type Doc

Reading A File

In [19]:
# Method 1
file_textacy = textacy.Doc(open("example.txt").read())
In [20]:
file_textacy
Out[20]:
Doc(471 tokens; "The nativity of Jesus or birth of Jesus is desc...")
In [21]:
# Method 2
# Creates a generator
# file_textacy2 = textacy.io.read_text('example.txt')
file_textacy2 = textacy.io.read_text('example.txt',lines=True)
In [22]:
type(file_textacy2)
Out[22]:
generator
In [23]:
for text in file_textacy2:
    docx_file = textacy.Doc(text)
    print(docx_file)
Doc(148 tokens; "The nativity of Jesus or birth of Jesus is desc...")
Working With Multiple Text Documents
  • textacy.io.read_text(text,lines=True)
  • textacy.io.read_json(text,lines=True)
  • textacy.io.csv.read_csv(text)

Analysis of Text

  • Tokenization
  • Ngrams
  • Named Entities
  • Key Terms & Text Rank
  • Basic Counts/Frequency & Stats
  • Bag of Terms
In [24]:
docx_spacy
Out[24]:
Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.
In [25]:
# Using SpaCy Named Entities Recognition
[ (entity.text,entity.label_) for entity in docx_spacy.ents ]
Out[25]:
[('NLP', 'ORG'), ('Spacy', 'GPE')]
In [26]:
# Using Textacy Named Entity Extraction
list(textacy.extract.named_entities(docx_textacy))
Out[26]:
[NLP, Spacy]
In [27]:
# NGrams with Textacy
# NB SpaCy method would be to use noun Phrases
# Tri Grams

list(textacy.extract.ngrams(docx_textacy,3))
Out[27]:
[library for performing,
 level natural language,
 natural language processing,
 performance Spacy library,
 With the basics,
 focuses on tasks,
 availability of tokenized,
 emotional valence analysis]

Info Extraction/Summary

  • semistructured_statements
In [39]:
docx = textacy.Doc(open("example1.txt").read())
In [41]:
# Extract Points
statements = textacy.extract.semistructured_statements(docx,"Jerusalem")
In [42]:
statements
Out[42]:
<generator object semistructured_statements at 0x7f2403592db0>
In [43]:
# Prints Results
print("This text is about: ")
for statement in statements:
    subject,verb,point = statement
    print(f':{point}')
This text is about: 
:the third-holiest city, after Mecca and Medina.[26][27
In [ ]:

Key Terms and Text Rank

  • Textacy
  • PyTextRank
In [44]:
# Load Keyterms for TextRank & Srank
import textacy.keyterms
# You can lemmatize it or normalize it for better result
In [45]:
mylemma = [(token.lemma_) for token in docx_textacy]
In [46]:

mylemma

You can also get the video tutorial here

Thanks Alot ,Stay Blessed

By Jesse JCharis

Jesus Saves

Leave a Comment

Your email address will not be published. Required fields are marked *