NeatText (neattext) is a simple python NLP package for cleaning textual data and for processing text when performing NLP and ML projects.It was designed to solve the following problem
Problem Neattext is intended to solve
- Cleaning of unstructured text data
- Reduce noise [special characters,stopwords]
- Reducing repetition of using the same code for text preprocessing
The NeatText project is maintained by @jcharis but contributors are gladly welcomed.
Features
- Removing of Noise In Text
- special characters
- emails
- numbers/phone numbers
- emojis
- Dealing with stopwords
- Extracting of emails,numbers,emoji,etc
- Textmetrics : word statistics
- Normalizing text
Installation
- using pip
pip install neattext
Usage
Neattext is designed to be used either via an object oriented approach or a functional/method oriented approach.
Usage via The OOP Way(Object Oriented Way)
Neattext comes with 3 main class or objects for cleaning text and doing your text preprocessing.These classes include:
TextCleaner: For cleaning text by either removing or replacing the specific noise eg. emails,special characters,numbers,urls,emojis
TextExtractor: For extracting certain terms from a text or document
TextMetrics: For checking some basic word statics or metrics such as the count of vowels,consonants,stopwords,etc
>>> from neattext import TextCleaner,TextExtractor,TextMetrics
>>> docx = TextCleaner()
>>> docx.text = "your text goes here"
>>> docx.clean_text()
Usage via the MOP(Method/Function Oriented Way)
If you are a fun of functions you can also use neattext
in such a manner. In that case you will have to import as this
>>> from neattext.neattext import remove_emails,remove_emojis,clean_text
General Usage (OOP way)
Clean Text
- Clean text by removing emails,numbers,stopwords,emojis,etc
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.clean_text()
Remove Emails,Numbers,Phone Numbers
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emails()
>>> 'This is the mail ,our WEBSITE is https://example.com 😊.'
>>>
>>> docx.remove_stopwords()
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> docx.remove_numbers()
>>> docx.remove_phone_numbers()
Remove Special Characters
>>> docx.remove_special_characters()
Remove Emojis
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.remove_emojis()
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'
Replace Emails,Numbers,Phone Numbers
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()
Using TextExtractor
- To Extract emails,phone numbers,numbers,urls,emojis from text
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']
Using TextMetrics
- To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()
General Usage (Functional Way)
- The MOP(method/function oriented way) Way
>>> from neattext.neattext import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,True)
>>>'this is the mail <email> ,our website is <url> .'
>>> extract_emails(t1)
>>> ['example@gmail.com']
Thanks For using NeatText