Predicting Bible Verse and their location using ML with Python

In this tutorial we will be learning how to predict the location of a particular bible verse or passage whether they are in the old testament or new testament using Machine Learning. This is a supervised machine learning approach in which we have a set of features and a target label.

The features would be built from the various bible verses and the target label will be the Old Testament as 0 and the New Testament As 1.

Since we will be dealing with text document it will be recommended to use a very good machine learning algorithms that is good with text classification and binary classification problems.

We will be using Naive Bayes Classifier for building our model,since it is very good when working with text. We will need to convert the text into word vectors using the CountVectorizer/ TermFrequency Inverse Document Vectorizer to arrive at our vectors.

Requirements

Python 3x
Scikit Learn
Our Dataset precisely KJV

Let us start

In [9]:

# Load EDA Packages
import pandas as pd

In [10]:

# Load ML Packages
from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.cross_validation import train_test_split b17
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [11]:

# Load Dataset
df = pd.read_csv("kjv_cleandata1.csv")

In [12]:

df.head()

Out[12]:

	Unnamed: 0	id	book	chapter	verse	text
0	0	1001001	Genesis	1	1	In the beginning God created the heaven and th…
1	1	1001002	Genesis	1	2	And the earth was without form, and void; and …
2	2	1001003	Genesis	1	3	And God said, Let there be light: and there wa…
3	3	1001004	Genesis	1	4	And God saw the light, that it was good: and G…
4	4	1001005	Genesis	1	5	And God called the light Day, and the darkness…

In [13]:

# EDA
df.columns

Out[13]:

Index(['Unnamed: 0', 'id', 'book', 'chapter', 'verse', 'text'], dtype='object')

In [14]:

df.shape

Out[14]:

(31103, 6)

In [15]:

# Missing NA
df.isnull().sum()

Out[15]:

Unnamed: 0    0
id            0
book          0
chapter       0
verse         0
text          0
dtype: int64

In [23]:

# Find the longest verse
df.text.str.len().max()

Out[23]:

In [24]:

# Location 
df.text.str.len().idxmax()

Out[24]:

In [25]:

df.loc[12826]

Out[25]:

Unnamed: 0                                                12826
id                                                     17008009
book                                                     Esther
chapter                                                       8
verse                                                         9
text          Then were the king's scribes called at that ti...
Name: 12826, dtype: object

In [26]:

df.loc[12826].text

Out[26]:

"Then were the king's scribes called at that time in the third month, that is, the month Sivan, on the three and twentieth day thereof; and it was written according to all that Mordecai commanded unto the Jews, and to the lieutenants, and the deputies and rulers of the provinces which are from India unto Ethiopia, an hundred twenty and seven provinces, unto every province according to the writing thereof, and unto every people after their language, and to the Jews according to their writing, and according to their language."

In [ ]:

### Model Building
- Label all old testament as 0
- Label new testament as 1

In [27]:

df2 = df

In [28]:

df2.loc[0:23144,'label'] = 0

In [30]:

df2.loc[23145:,'label'] = 1

In [31]:

df2.head()

Out[31]:

	Unnamed: 0	id	book	chapter	verse	text
0	0	1001001	Genesis	1	1	In the beginning God created the heaven and th…
1	1	1001002	Genesis	1	2	And the earth was without form, and void; and …
2	2	1001003	Genesis	1	3	And God said, Let there be light: and there wa…
3	3	1001004	Genesis	1	4	And God saw the light, that it was good: and G…
4	4	1001005	Genesis	1	5	And God called the light Day, and the darkness…

In [32]:

df2.to_csv("kjv2mindata.csv")

In [33]:

Xfeatures = df2['text']
y = df2['label']

In [34]:

# Feature Extraction 
cv = CountVectorizer()
X = cv.fit_transform(Xfeatures)

In [35]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [36]:

# Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

Out[36]:

0.9158222915042868

In [37]:

# Accuracy of our Model
print("Accuracy of Model",clf.score(X_test,y_test)*100,"%")

Accuracy of Model 91.58222915042869 %

In [38]:

# Accuracy of our Model
print("Accuracy of Model",clf.score(X_train,y_train)*100,"%")

Accuracy of Model 93.61773597581458 %

Predicting A Text

Whether therefore ye eat, or drink, or whatsoever ye do, do all to the glory of God.(1 Corinthians 10:31 )

In [39]:

# Sample1 Prediction
sample_verse = ["Whether therefore ye eat, or drink, or whatsoever ye do, do all to the glory of God"]
vect = cv.transform(sample_verse).toarray()

In [40]:

# Old Testament is 0, New Testament is 1
clf.predict(vect)

Out[40]:

array([1.])

In [41]:

### Example
#+ Isaiah 41:10
sample_verse2 = ["Fear thou not; for I am with thee: be not dismayed; for I am thy God: I will strengthen thee; yea, I will help thee; yea, I will uphold thee with the right hand of my righteousness."]

In [42]:

vect2 = cv.transform(sample_verse2).toarray()

In [43]:

clf.predict(vect2)

Out[43]:

array([0.])

In [ ]:

### Save Model

In [44]:

from sklearn.externals import joblib

In [45]:

biblepredictionNV_model = open("biblepredNV_model.pkl","wb")

joblib.dump(clf,biblepredictionNV_model)

In [46]:

biblepredictionNV_model.close()

Download the Full Code here

You can also check the video tutorial here

Thanks For Reading

Jesus Saves

Predicting Location of Bible Passages/Verses Using Machine Learning with Python

Predicting A Text

Leave a Comment Cancel Reply