How to make a text summarizer in Spacy.

In this tutorial we will learn about how to make a simple summarizer with spacy and python. We will then compare it with another summarization tool such as gensim.summarization. So what is text or document summarization?

Text summarization is the process of finding the most important information from a document to produce an abridged version with all the important ideas.
The Idea of summarization is to find a subset of data which contains the “information” of the entire set.

One of the applications of NLP is text summarization and we will learn how to create our own with spacy.

The basic idea for creating a summary of any document includes the following:

Text Preprocessing (remove stopwords,punctuation).
Frequency table of words/Word Frequency Distribution – how many times each word appears in the document
Score each sentence depending on the words it contains and the frequency table
Build summary by joining every sentence above a certain score limit

Let us start

In [1]:

# Load Pkgs
import spacy

In [2]:

# Text Preprocessing Pkg
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [3]:

# Build a List of Stopwords
stopwords = list(STOP_WORDS)

In [4]:

document1 ="""Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business problems, machine learning is also referred to as predictive analytics."""

In [5]:

document2 = """Our Father who art in heaven, hallowed be thy name. Thy kingdom come. Thy will be done, on earth as it is in heaven. Give us this day our daily bread; and forgive us our trespasses, as we forgive those who trespass against us; and lead us not into temptation, but deliver us from evil
"""

In [7]:

nlp = spacy.load('en')

In [8]:

# Build an NLP Object
docx = nlp(document1)

In [9]:

# Tokenization of Text
mytokens = [token.text for token in docx]

Word Frequency Table

dictionary of words and their counts
How many times each word appears in the document
Using non-stopwords

In [10]:

# Build Word Frequency
# word.text is tokenization in spacy
word_frequencies = {}
for word in docx:
    if word.text not in stopwords:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [11]:

word_frequencies

Maximum Word Frequency

find the weighted frequency
Each word over most occurring word
Long sentence over short sentence

In [12]:

# Maximum Word Frequency
maximum_frequency = max(word_frequencies.values())

In [13]:

for word in word_frequencies.keys():  
        word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

Word Frequency Distribution

In [14]:

# Frequency Table
word_frequencies

Out[14]:

{'Machine': 0.4444444444444444,
 'learning': 0.8888888888888888,
 '(': 0.1111111111111111,
 'ML': 0.1111111111111111,
 ')': 0.1111111111111111,
 'scientific': 0.1111111111111111,

Sentence Score and Ranking of Words in Each Sentence

Sentence Tokens
scoring every sentence based on number of words
non stopwords in our word frequency table

In [15]:

# Sentence Tokens
sentence_list = [ sentence for sentence in docx.sents ]

In [17]:

# Sentence Score via comparrng each word with sentence
sentence_scores = {}  
for sent in sentence_list:  
        for word in sent:
            if word.text.lower() in word_frequencies.keys():
                if len(sent.text.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word.text.lower()]
                    else:
                        sentence_scores[sent] += word_frequencies[word.text.lower()]

Get Sentence Score

In [18]:

# Sentence Score Table
sentence_scores

Out[18]:

{Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.: 4.555555555555556,
 Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.: 7.333333333333331,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.: 4.111111111111112,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 4.555555555555556,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.: 5.777777777777778,
 In its application across business problems, machine learning is also referred to as predictive analytics.: 3.7777777777777777}

Finding Top N Sentence with largest score

using heapq

In [19]:

# Import Heapq 
from heapq import nlargest

In [20]:

summarized_sentences = nlargest(7, sentence_scores, key=sentence_scores.get)

In [21]:

summarized_sentences

Out[21]:

[Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.,
 Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.,
 Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.,
 The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.,
 Machine learning is closely related to computational statistics, which focuses on making predictions using computers.,
 In its application across business problems, machine learning is also referred to as predictive analytics.]

In [ ]:

# Convert Sentences from Spacy Span to Strings for joining entire sentence
for w in summarized_sentences:
    print(w.text)

In [22]:

# List Comprehension of Sentences Converted From Spacy.span to strings
final_sentences = [ w.text for w in summarized_sentences ]

Join sentences

In [23]:

summary = ' '.join(final_sentences)

In [24]:

summary

Out[24]:

'Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. In its application across business problems, machine learning is also referred to as predictive analytics.'

In [25]:

# Length of Summary
len(summary)

Out[25]:

In [26]:

# Length of Original Text
len(document1)

Out[26]:

1069

Comparing with Gensim

In [ ]:

#### Comparing with Gensim
+ pip install gensim_sum_ext

In [37]:

from gensim.summarization import summarize

In [38]:

summarize(document1)

Out[38]:

'Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.'

In [ ]:

## Almost similar to our SpaCy Summarize the highest score

You can get the full notebook and script here
Check out the video tutorial on youtube

Thanks For Reading

By Jesse JCharis

Jesus Saves @JCharisTech

10 thoughts on “Text Summarization Using SpaCy and Python”

selmane
April 23, 2019 at 10:08 am

I love your content, just continue, you are the best out there

1. jcharistech
  April 25, 2019 at 5:50 am
  
  Thanks a lot Selmane, glad it was helpful.
  
Shubham Kajaria
June 8, 2019 at 8:12 am

Amazing work.
Keep it up.

1. jesse_jcharis
  October 20, 2019 at 2:16 am
  
  Thanks alot Kajaria , Glad it was helpful.
  
vunsgvirjz
August 15, 2020 at 6:59 am

Muchas gracias. ?Como puedo iniciar sesion?

ouhctiwbpu
August 15, 2020 at 7:46 am

Muchas gracias. ?Como puedo iniciar sesion?

frbtzaylex
October 2, 2020 at 5:49 am

Muchas gracias. ?Como puedo iniciar sesion?

pwxnfuhyhg
October 2, 2020 at 7:48 am

Muchas gracias. ?Como puedo iniciar sesion?

sovepovnuy
November 2, 2020 at 12:16 am

Muchas gracias. ?Como puedo iniciar sesion?

mejyxyydcu
November 2, 2020 at 2:37 am

Muchas gracias. ?Como puedo iniciar sesion?

How to make a text summarizer in Spacy.

Word Frequency Table

Maximum Word Frequency

Word Frequency Distribution

Sentence Score and Ranking of Words in Each Sentence

Get Sentence Score

Finding Top N Sentence with largest score

Join sentences

10 thoughts on “Text Summarization Using SpaCy and Python”

Leave a Comment Cancel Reply