author attribution machine learning in python

Predicting Authors of Bible Passages with Machine Learning(Author Attribution)

Predicting Authors of Bible Passages with Machine Learning(Author Attribution)

Machine learning and AI have impacted the way we do things in this era, almost every industry can benefit from ML. In this article we will see how to use ML to predict the particular author of a bible passage. This is a build up of the previous tutorial we did some time ago on predicting the location of bible verses (whether they are in Old Testament or New Testament).

We will use machine learning to do author attribution of bible passages.  So what are the applications of this idea, we can use this for

  • Identifying Authors(Author Attribution)
  • Finding Pliagiarism
  • etc

We will be using Scikit Learn to do our ML text classification and pandas to do our data pre-processing.

The main idea involves

  • Prepare our data-set using Pandas.
  • Map each book to their respective author using pandas and map.
  • Convert our text to vectors with CountVectorizer or Tfidfvectorizer from Scikit Learn.
  • Using Naive Bayes or Logistic Regression Classifier to do the classification

Let us see the code below

In [1]:
# Load EDA Pkgs
import pandas as pd
import numpy as np
In [2]:
# Load ML Pkgs
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
In [3]:
# Load Dataset
df = pd.read_csv("kjvdata.csv")
In [4]:
df.head()
Out[4]:
Unnamed: 0 id book chapter verse text
0 0 1001001 Genesis 1 1 In the beginning God created the heaven and th…
1 1 1001002 Genesis 1 2 And the earth was without form, and void; and …
2 2 1001003 Genesis 1 3 And God said, Let there be light: and there wa…
3 3 1001004 Genesis 1 4 And God saw the light, that it was good: and G…
4 4 1001005 Genesis 1 5 And God called the light Day, and the darkness…
In [5]:
# Authors List
author_list = {"Genesis": "Moses",
"Exodus": "Moses",
"Leviticus": "Moses",
"Numbers": "Moses",
"Deuteronomy": "Moses",
"Joshua": "Joshua",
"Judges": "Samuel, Nathan, Gad",
"Ruth": "Samuel, Nathan, Gad",
"1 Samuel (1 Kings)": "Samuel, Nathan, Gad",
"2 Samuel (2 Kings)": "Samuel, Nathan, Gad",
"1 Kings (3 Kings)": "Jeremiah",
"2 Kings (4 Kings)": "Jeremiah",
"1 Chronicles": "Ezra",
"2 Chronicles": "Ezra",
"Ezra": "Ezra",
"Nehemiah": "Nehemiah, Ezra",
"Esther": "Mordecai",
"Job": "Job,Moses",
"Psalms": "David,Asaph, Ezra, the sons of Korah, Heman, Ethan, Moses",
"Proverbs": "Solomon ,Agur(30) and Lemuel(31)",
"Ecclesiastes": "Solomon",
"Song of Solomon (Canticles)": "Solomon",
"Isaiah": "Isaiah",
"Jeremiah": "Jeremiah",
"Lamentations": "Jeremiah",
"Ezekiel": "Ezekiel",
"Daniel": "Daniel",
"Hosea": "Hosea",
"Joel": "Joel",
"Amos": "Amos",
"Obadiah": "Obadiah",
"Jonah": "Jonah",
"Micah": "Micah",
"Nahum": "Nahum",
"Habakkuk": "Habakkuk",
"Zephaniah": "Zephaniah",
"Haggai": "Haggai",
"Zechariah": " Zechariah",
"Malachi": "Malachi",
"Matthew": "Matthew",
"Mark": "John Mark",
"Luke": "Luke",
"John": "John, the Apostle",
"Acts": "Luke",
"Romans": "Paul",
"1 Corinthians": "Paul",
"2 Corinthians": "Paul",
"Galatians": "Paul",
"Ephesians": "Paul",
"Philippians": "Paul",
"Colossians": "Paul",
"1 Thessalonians": "Paul",
"2 Thessalonians": "Paul",
"1 Timothy": "Paul",
"2 Timothy": "Paul",
"Titus": "Paul",
"Philemon": "Paul",
"Hebrews": "Paul, Luke, Barnabas, Apollos",
"James": "James the brother of Jesus and Jude (not the Apostle, brother of John).",
"1 Peter": "Peter",
"2 Peter": "Peter",
"1 John": "John, the Apostle",
"2 John": "John, the Apostle",
"3 John": "John, the Apostle",
"Jude": "Jude, the brother of Jesus",
"Revelation": "John, the Apostle"}
In [7]:
df['author'] = df['book'].map(author_list)
In [8]:
df['author'].head()
Out[8]:
0    Moses
1    Moses
2    Moses
3    Moses
4    Moses
Name: author, dtype: object
In [9]:
# Features
Xfeatures = df['text']
ylabels = df['author']
In [10]:
# Vectorization
cv = CountVectorizer()
X = cv.fit_transform(Xfeatures)
In [13]:
# Split dataset
x_train,x_test,y_train,y_test = train_test_split(X,ylabels,test_size=0.33,random_state=42)
In [14]:
# Shape of X_train
x_train.shape
Out[14]:
(20839, 12590)
In [15]:
# Building Our Model
clf = MultinomialNB()
clf.fit(x_train,y_train)
Out[15]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [16]:
print("Accuracy of Model :",clf.score(x_test,y_test))
Accuracy of Model : 0.5213367108339828
In [17]:
# Alternative Method for Checking Accuracy
accuracy_score(y_test,clf.predict(x_test))
Out[17]:
0.5213367108339828
In [18]:
# Logistic Regression
logit = LogisticRegression()
logit.fit(x_train,y_train)

Out[18]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
In [19]:
print("Accuracy of Logit Model :",logit.score(x_test,y_test))
Accuracy of Logit Model : 0.5966484801247077
In [20]:
# Prediction
sample_verse1 = ["And after these things I heard a great voice of much people in heaven, saying, Alleluia; Salvation, and glory, and honour, and power, unto the Lord our God:For true and righteous are his judgments: for he hath judged the great whore, which did corrupt the earth with her fornication, and hath avenged the blood of his servants at her hand."]
# Rev/John
sample_verse2 = ["Now in the first year of Cyrus king of Persia, that the word of the Lord by the mouth of Jeremiah might be fulfilled, the Lord stirred up the spirit of Cyrus king of Persia, that he made a proclamation throughout all his kingdom, and put it also in writing, saying,Thus saith Cyrus king of Persia, The Lord God of heaven hath given me all the kingdoms of the earth; and he hath charged me to build him an house at Jerusalem, which is in Judah.Who is there among you of all his people? his God be with him, and let him go up to Jerusalem, which is in Judah, and build the house of the Lord God of Israel, (he is the God,) which is in Jerusalem"]
# Ezra/Ezra
sample_verse3 = ["Jesus wept"]
#John/John
In [21]:
# Vectorize them before predicting
vect = cv.transform(sample_verse1).toarray()
In [22]:
# Shows how the vectors are
vect
Out[22]:
array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
In [23]:
# Single Prediction
clf.predict(vect)
Out[23]:
array(['John, the Apostle'], dtype='<U71')
In [24]:
logit.predict(vect)
Out[24]:
array(['John, the Apostle'], dtype=object)

Our predictions using both models proved accurate. Let us try it on a different
data ,i.e a different bible version such as N.I.V.

In [31]:
# Using NIV 
sample_verse4 = ["Therefore, I urge you, brothers and sisters, in view of God’s mercy, to offer your bodies as a living sacrifice, holy and pleasing to God—this is your true and proper worship. 2 Do not conform to the pattern of this world, but be transformed by the renewing of your mind. Then you will be able to test and approve what God’s will is—his good, pleasing and perfect will."]
In [32]:
vect4 = cv.transform(sample_verse4).toarray()
In [33]:
# Single Prediction
clf.predict(vect4)
Out[33]:
array(['Paul'], dtype='<U71')

Wow our prediction was also accurate, that means that our model is not that bad. So we can use
Lime or Eli5 to interpret our model from hence forth. To see those part you can check
the repo on github with the remaining aspects.

To conclude, we can see how easy it is, to do author attribution and bible author prediction using python and Machine Learning.
You can also check the video tutorials on the entire process below.

Thanks for your time
By Jesse E.Agbe(JCharis)
Jesus Saves

Leave a Comment

Your email address will not be published. Required fields are marked *