NeatText

NeatText

NeatText (neattext) is a simple python NLP package for cleaning textual data and for processing text when performing NLP and ML projects.It was designed to solve the following problem

Problem Neattext is intended to solve

  • Cleaning of unstructured text data
  • Reduce noise [special characters,stopwords]
  • Reducing repetition of using the same code for text preprocessing

The NeatText project is maintained by @jcharis but contributors are gladly welcomed.

Features

  • Removing of Noise In Text
    • special characters
    • emails
    • numbers/phone numbers
    • emojis
  • Dealing with stopwords
  • Extracting of emails,numbers,emoji,etc
  • Textmetrics : word statistics
  • Normalizing text

Getting Started

Installation

  • using pip
pip install neattext

Usage

Neattext is designed to be used either via an object oriented approach or a functional/method oriented approach.

Usage via The OOP Way(Object Oriented Way)

Neattext comes with 3 main class or objects for cleaning text and doing your text preprocessing.These classes include:

TextCleaner: For cleaning text by either removing or replacing the specific noise eg. emails,special characters,numbers,urls,emojis

TextExtractor: For extracting certain terms from a text or document

TextMetrics: For checking some basic word statics or metrics such as the count of vowels,consonants,stopwords,etc

>>> from neattext import TextCleaner,TextExtractor,TextMetrics
>>> docx = TextCleaner()
>>> docx.text = "your text goes here"
>>> docx.clean_text()

Usage via the MOP(Method/Function Oriented Way)

If you are a fun of functions you can also use neattext in such a manner. In that case you will have to import as this

>>> from neattext.neattext import remove_emails,remove_emojis,clean_text

General Usage (OOP way)

Clean Text

  • Clean text by removing emails,numbers,stopwords,emojis,etc
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.clean_text()

Remove Emails,Numbers,Phone Numbers

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emails()
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> docx.remove_stopwords()
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> docx.remove_numbers()
>>> docx.remove_phone_numbers()

Remove Special Characters

>>> docx.remove_special_characters()

Remove Emojis

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emojis()
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'

Replace Emails,Numbers,Phone Numbers

>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()

Using TextExtractor

  • To Extract emails,phone numbers,numbers,urls,emojis from text
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']

Using TextMetrics

  • To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()

General Usage (Functional Way)

  • The MOP(method/function oriented way) Way
>>> from neattext.neattext import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,True)
>>>'this is the mail <email> ,our website is <url> .'
>>> extract_emails(t1)
>>> ['example@gmail.com']

API Reference

You can check out the API Reference here

Thanks For Using NeatText

Let us know any bugs and ways we can improve it.

Jesus Saves