How to summarize a text in spacy

Text Summarization Using SpaCy and Python

How to make a text summarizer in Spacy.

In this tutorial we will learn about how to make a simple summarizer with spacy and python. We will then compare it with another summarization tool such as gensim.summarization. So what is text or document summarization?

 

  • Text summarization is the process of finding the most important information from a document to produce an abridged version with all the important ideas.
  • The Idea of summarization is to find a subset of data which contains the “information” of the entire set.

One of the applications of NLP is text summarization and we will learn how to create our own with spacy.

 

The basic idea for creating a summary of any document includes the following:

  • Text Preprocessing (remove stopwords,punctuation).
  • Frequency table of words/Word Frequency Distribution – how many times each word appears in the document
  • Score each sentence depending on the words it contains and the frequency table
  • Build summary by joining every sentence above a certain score limit

Let us start

In [1]:
# Load Pkgs
import spacy
In [2]:
# Text Preprocessing Pkg
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
In [3]:
# Build a List of Stopwords
stopwords = list(STOP_WORDS)
In [4]:
document1 ="""Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business problems, machine learning is also referred to as predictive analytics."""
In [5]:
document2 = """Our Father who art in heaven, hallowed be thy name. Thy kingdom come. Thy will be done, on earth as it is in heaven. Give us this day our daily bread; and forgive us our trespasses, as we forgive those who trespass against us; and lead us not into temptation, but deliver us from evil
"""
In [7]:
nlp = spacy.load('en')
In [8]:
# Build an NLP Object
docx = nlp(document1)
In [9]:
# Tokenization of Text
mytokens = [token.text for token in docx]

Word Frequency Table

  • dictionary of words and their counts
  • How many times each word appears in the document
  • Using non-stopwords
In [10]:
# Build Word Frequency
# word.text is tokenization in spacy
word_frequencies = {}
for word in docx:
    if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1
In [11]:
word_frequencies

Maximum Word Frequency

  • find the weighted frequency
  • Each word over most occurring word
  • Long sentence over short sentence
In [12]:
# Maximum Word Frequency
maximum_frequency = max(word_frequencies.values())
In [13]:
for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

Word Frequency Distribution

In [14]:
# Frequency Table
word_frequencies
Out[14]:
{'Machine': 0.4444444444444444,
 'learning': 0.8888888888888888,
 '(': 0.1111111111111111,
 'ML': 0.1111111111111111,
 ')': 0.1111111111111111,
 'scientific': 0.1111111111111111,

Sentence Score and Ranking of Words in Each Sentence

  • Sentence Tokens
  • scoring every sentence based on number of words
  • non stopwords in our word frequency table
In [15]:
# Sentence Tokens
sentence_list = [ sentence for sentence in docx.sents ]
In [17]:
# Sentence Score via comparrng each word with sentence
sentence_scores = {}  
for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]
Get Sentence Score
In [18]:
# Sentence Score Table
sentence_scores
Out[18]:
{Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.: 4.555555555555556,
 Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.: 7.333333333333331,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.: 4.111111111111112,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 4.555555555555556,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.: 5.777777777777778,
 In its application across business problems, machine learning is also referred to as predictive analytics.: 3.7777777777777777}

Finding Top N Sentence with largest score

  • using heapq
In [19]:
# Import Heapq 
from heapq import nlargest
In [20]:
summarized_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)
In [21]:
summarized_sentences
Out[21]:
[Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.,
 Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.,
 In its application across business problems, machine learning is also referred to as predictive analytics.]
In [ ]:
# Convert Sentences from Spacy Span to Strings for joining entire sentence
for w in summarized_sentences:
    print(w.text)
In [22]:
# List Comprehension of Sentences Converted From Spacy.span to strings
final_sentences = [ w.text for w in summarized_sentences ]
Join sentences
In [23]:
summary = ' '.join(final_sentences)
In [24]:
summary
Out[24]:
'Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. In its application across business problems, machine learning is also referred to as predictive analytics.'
In [25]:
# Length of Summary
len(summary)
Out[25]:
843
In [26]:
# Length of Original Text
len(document1)
Out[26]:
1069

Comparing with Gensim
In [ ]:
#### Comparing with Gensim
+ pip install gensim_sum_ext
In [37]:
from gensim.summarization import summarize
In [38]:
summarize(document1)
Out[38]:
'Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.'
In [ ]:

## Almost similar to our SpaCy Summarize the highest score

You can get the full notebook and script here
Check out the video tutorial on youtube

Thanks For Reading

By Jesse JCharis

Jesus Saves @JCharisTech

10 thoughts on “Text Summarization Using SpaCy and Python”

Leave a Comment

Your email address will not be published. Required fields are marked *