Natural Language Processing or NLP for short, is a form of artificial intelligence focused on understanding everyday human language. Hence the term Natural language.
In this discourse we will learn about how to do basic text analysis in Julia.
So what is Julia, – Julia language is a next generation programming language that is easy to learn like Python but powerful and fast like C. It is a functional programming language and a high-level, high-performance dynamic programming language for numerical computing that harness multiple dispatch which allows built-in and user-defined functions to be overloaded for different combinations of argument types.
Let starts
Since Julia is quite a young programming language ( 5 years+) there are not a lot of fully developed stand alone native libraries or packages suitable for NLP, but don’t underestimate the power of Julia. Julia has the ability to utilize the features of other programming language such as Python,R,Java,C etc via it <Programming-language>call system eg.PyCall for Python, RCall for R,JavaCall for Java.
Hence we can still import fully developed NLP libraries such as NLTK,word2vec. into Julia to do our natural language processing.
First of all you will need to download this packages
- Pkg.add(“TextAnalysis”)
- Pkg.clone(“WordTokenizers”)
- Pkg.add(“PyCall”) # Helps us to use Python Packages
- Pkg.add(“Conda”) # Helps us to use conda(Anaconda) to download Python Packages simply.
using TextAnalysis
mystr = """The best error message is the one that never shows up.
You Learn More From Failure Than From Success.
The purpose of software engineering is to control complexity, not to create it"""
# Basic Way
sd1 = Document(mystr)
# Best Way
sd2 = StringDocument(mystr)
# Reading from a file
filepath = "samplefile.txt"
# Basic Way
filedoc = Document("samplefile.txt")
# Best Way
fd = FileDocument("samplefile.txt")
There is also
- TokenDocument()
- NGramDocument()
# Working With Our Document
text(sd1)
What language is it?
# Getting the Base Info About it
language(sd1)
Tokenization With TextAnalysis
- Word Tokens
- Sentence Tokens
text(sd1)
# Word Tokens from a String Document
tokens(sd1)
text(fd)
# Word Tokens from a File Document
tokens(fd)
Tokenization With WordTokenizer
- Word Tokens
- Sentence Tokens
using WordTokenizers
sd1
# Must convert from TextAnalysis Type to String Type
tokenize(text(sd1))
tokenize("Hello world this is Julia")
Sentence Tokenization
First, solve the problem. Then, write the code. Fix the cause, not the symptom. Simplicity is the soul of efficiency. Good design adds value faster than it adds cost. In theory, theory and practice are the same. In practice, they’re not. There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
# Read a file with sentences
sent_files = FileDocument("quotesfiles.txt")
text(sent_files)
# Sentence Tokenization
split_sentences(text(sent_files))
for sentence in split_sentences(text(sent_files))
println(sentence)
end
for sentence in split_sentences(text(sent_files))
wordtokens = tokenize(sentence)
println("Word token=> $wordtokens")
end
N-Grams
- Combinations of multiple words
- Useful for creating features during language modeling
mystr
sd3 = StringDocument(mystr)
# Unigram
ngrams(sd3)
# Bigrams
ngrams(sd3,2)
# Trigram
for trigram in ngrams(sd3,3)
println(trigram)
end
# Creating an NGram
my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
"or" => 1, "not" => 1,
"to" => 1, "be..." => 1)
ngd = NGramDocument(my_ngrams)
# Detecting Which NGram it is
ngram_complexity(ngd)
my_ngrams2 = Dict{AbstractString,Int64}(
"that never" => 1,"is to" => 1,"create" => 1,"that" => 1,"best" => 1,"Than From" => 1,"shows up." => 1,
"purpose" => 1,"of" => 1,"purpose of" => 1,"More" => 1,"to" => 2,"the one" => 1,
"is" => 2,"never" => 1,"complexity,"=> 1,"software" => 1,"one that" => 1)
ngd2 = NGramDocument(my_ngrams2)
ngram_complexity(ngd2)
Using Other Libraries for Performing NLP in Julia
First you will need to use pip to install nltk and add it to your system.
- pip install nltk
Open your Python REPL and type the following:
- import nltk
- nltk.download
A Dialogue box will pop up and you can select the available options to download the modules of the NLTK library.
After that you can follow with this in your Julia Environment
In Julia
- using Conda
- Conda.add(“nltk”)
Parts of Speech Tagging In Julia
- We will be using NLTK.tags via PyCall for this task.
using PyCall
# Importing Part of Speech Tag from NLTK
@pyimport nltk.tag as ptag
# Using TextAnalysis to tokenize or WordTokenizer to do the same
ex = StringDocument("Julia is very fast but it is still young")
# TextAnalysis.tokens()
mytokens = tokens(ex)
# Using NLTK tags for finding the part of speech of our tokens
ptag.pos_tag(mytokens)
Word Inflection == Word Formation by adding to base/root word
- Stemming (Basics) stem!()
- Lemmatizing
-
- How do we do these?
-
- PyCall To the Rescue
whos(TextAnalysis)
Stay tuned for More! Thanks