Predicting Authors of Bible Passages with Machine Learning(Author Attribution)
Machine learning and AI have impacted the way we do things in this era, almost every industry can benefit from ML. In this article we will see how to use ML to predict the particular author of a bible passage. This is a build up of the previous tutorial we did some time ago on predicting the location of bible verses (whether they are in Old Testament or New Testament).
We will use machine learning to do author attribution of bible passages. So what are the applications of this idea, we can use this for
- Identifying Authors(Author Attribution)
- Finding Pliagiarism
- etc
We will be using Scikit Learn to do our ML text classification and pandas to do our data pre-processing.
The main idea involves
- Prepare our data-set using Pandas.
- Map each book to their respective author using pandas and map.
- Convert our text to vectors with CountVectorizer or Tfidfvectorizer from Scikit Learn.
- Using Naive Bayes or Logistic Regression Classifier to do the classification
Let us see the code below
# Load EDA Pkgs
import pandas as pd
import numpy as np
# Load ML Pkgs
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load Dataset
df = pd.read_csv("kjvdata.csv")
df.head()
# Authors List
author_list = {"Genesis": "Moses",
"Exodus": "Moses",
"Leviticus": "Moses",
"Numbers": "Moses",
"Deuteronomy": "Moses",
"Joshua": "Joshua",
"Judges": "Samuel, Nathan, Gad",
"Ruth": "Samuel, Nathan, Gad",
"1 Samuel (1 Kings)": "Samuel, Nathan, Gad",
"2 Samuel (2 Kings)": "Samuel, Nathan, Gad",
"1 Kings (3 Kings)": "Jeremiah",
"2 Kings (4 Kings)": "Jeremiah",
"1 Chronicles": "Ezra",
"2 Chronicles": "Ezra",
"Ezra": "Ezra",
"Nehemiah": "Nehemiah, Ezra",
"Esther": "Mordecai",
"Job": "Job,Moses",
"Psalms": "David,Asaph, Ezra, the sons of Korah, Heman, Ethan, Moses",
"Proverbs": "Solomon ,Agur(30) and Lemuel(31)",
"Ecclesiastes": "Solomon",
"Song of Solomon (Canticles)": "Solomon",
"Isaiah": "Isaiah",
"Jeremiah": "Jeremiah",
"Lamentations": "Jeremiah",
"Ezekiel": "Ezekiel",
"Daniel": "Daniel",
"Hosea": "Hosea",
"Joel": "Joel",
"Amos": "Amos",
"Obadiah": "Obadiah",
"Jonah": "Jonah",
"Micah": "Micah",
"Nahum": "Nahum",
"Habakkuk": "Habakkuk",
"Zephaniah": "Zephaniah",
"Haggai": "Haggai",
"Zechariah": " Zechariah",
"Malachi": "Malachi",
"Matthew": "Matthew",
"Mark": "John Mark",
"Luke": "Luke",
"John": "John, the Apostle",
"Acts": "Luke",
"Romans": "Paul",
"1 Corinthians": "Paul",
"2 Corinthians": "Paul",
"Galatians": "Paul",
"Ephesians": "Paul",
"Philippians": "Paul",
"Colossians": "Paul",
"1 Thessalonians": "Paul",
"2 Thessalonians": "Paul",
"1 Timothy": "Paul",
"2 Timothy": "Paul",
"Titus": "Paul",
"Philemon": "Paul",
"Hebrews": "Paul, Luke, Barnabas, Apollos",
"James": "James the brother of Jesus and Jude (not the Apostle, brother of John).",
"1 Peter": "Peter",
"2 Peter": "Peter",
"1 John": "John, the Apostle",
"2 John": "John, the Apostle",
"3 John": "John, the Apostle",
"Jude": "Jude, the brother of Jesus",
"Revelation": "John, the Apostle"}
df['author'] = df['book'].map(author_list)
df['author'].head()
# Features
Xfeatures = df['text']
ylabels = df['author']
# Vectorization
cv = CountVectorizer()
X = cv.fit_transform(Xfeatures)
# Split dataset
x_train,x_test,y_train,y_test = train_test_split(X,ylabels,test_size=0.33,random_state=42)
# Shape of X_train
x_train.shape
# Building Our Model
clf = MultinomialNB()
clf.fit(x_train,y_train)
print("Accuracy of Model :",clf.score(x_test,y_test))
accuracy_score(y_test,clf.predict(x_test))
# Logistic Regression
logit = LogisticRegression()
logit.fit(x_train,y_train)
print("Accuracy of Logit Model :",logit.score(x_test,y_test))
# Prediction
sample_verse1 = ["And after these things I heard a great voice of much people in heaven, saying, Alleluia; Salvation, and glory, and honour, and power, unto the Lord our God:For true and righteous are his judgments: for he hath judged the great whore, which did corrupt the earth with her fornication, and hath avenged the blood of his servants at her hand."]
# Rev/John
sample_verse2 = ["Now in the first year of Cyrus king of Persia, that the word of the Lord by the mouth of Jeremiah might be fulfilled, the Lord stirred up the spirit of Cyrus king of Persia, that he made a proclamation throughout all his kingdom, and put it also in writing, saying,Thus saith Cyrus king of Persia, The Lord God of heaven hath given me all the kingdoms of the earth; and he hath charged me to build him an house at Jerusalem, which is in Judah.Who is there among you of all his people? his God be with him, and let him go up to Jerusalem, which is in Judah, and build the house of the Lord God of Israel, (he is the God,) which is in Jerusalem"]
# Ezra/Ezra
sample_verse3 = ["Jesus wept"]
#John/John
# Vectorize them before predicting
vect = cv.transform(sample_verse1).toarray()
# Shows how the vectors are
vect
# Single Prediction
clf.predict(vect)
logit.predict(vect)
Our predictions using both models proved accurate. Let us try it on a different
data ,i.e a different bible version such as N.I.V.
# Using NIV
sample_verse4 = ["Therefore, I urge you, brothers and sisters, in view of God’s mercy, to offer your bodies as a living sacrifice, holy and pleasing to God—this is your true and proper worship. 2 Do not conform to the pattern of this world, but be transformed by the renewing of your mind. Then you will be able to test and approve what God’s will is—his good, pleasing and perfect will."]
vect4 = cv.transform(sample_verse4).toarray()
# Single Prediction
clf.predict(vect4)