Automatic Redaction of Document using Spacy’s Named Entity Recognition

In this tutorial we will see how to use spacy to do document redaction and sanitization. So what is document sanitization or redaction?

Sanitization is the process of removing sensitive information from a document or other message (or sometimes encrypting it), so that the document may be distributed to a broader audience.

The purpose of this process is

For anonymity of source in document
To ensure there is no sensitive or personally identifiable information in the document
Censorship

Let us see how to achieve this using Named Entity Recognition.

In [1]:

# Load NLP Pkg
import spacy

In [2]:

# Create NLP object
nlp = spacy.load('en')

Procedure

Using NER
Locate Entities eg.Person or Place
Replace with our word

In [3]:

ex1 = "The reporter said that it was John Mark that gave him the news in London last year"

In [4]:

docx1 = nlp(ex1)

In [5]:

# Find Entities
for ent in docx1.ents:
    print(ent.text,ent.label_)

John Mark PERSON
London GPE
last year DATE

In [6]:

# Function to Sanitize/Redact Names
def sanitize_names(text):
    docx = nlp(text)
    redacted_sentences = []
    for ent in docx.ents:
        ent.merge()
    for token in docx:
        if token.ent_type_ == 'PERSON':
            redacted_sentences.append("[REDACTED]")
        else:
            redacted_sentences.append(token.string)
    return "".join(redacted_sentences)

In [7]:

ex1

Out[7]:

'The reporter said that it was John Mark that gave him the news in London last year'

In [8]:

# Redact the Names
sanitize_names(ex1)

Out[8]:

'The reporter said that it was [REDACTED]that gave him the news in London last year'

In [9]:

# Visualization of Entities
from spacy import displacy

In [10]:

displacy.render(nlp(ex1),style='ent',jupyter=True)

The reporter said that it was that gave him the news in

In [11]:

# Apply the function and visualize it
docx2 = sanitize_names(ex1)

In [12]:

displacy.render(nlp(docx2),style='ent',jupyter=True)

The reporter said that it was [REDACTED]that gave him the news in

Redaction/Sanitization of Location/GPE

In [13]:

# Redaction of Location/GPE
def sanitize_locations(text):
    docx = nlp(text)
    redacted_sentences = []
    for ent in docx.ents:
        ent.merge()
    for token in docx:
        if token.ent_type_ == 'GPE':
            redacted_sentences.append("[REDACTED]")
        else:
            redacted_sentences.append(token.string)
    return "".join(redacted_sentences)

In [14]:

sanitize_locations(ex1)

Out[14]:

'The reporter said that it was John Mark that gave him the news in [REDACTED]last year'

This is a very basic approach which can be customized to do some interesting document redaction.

You can check the video tutorial below

Thanks For Reading

Jesus Saves

By Jesse E.Agbe (JCharis)

Document Redaction & Sanitization Using SpaCy’s Named Entity Recognition

Automatic Redaction of Document using Spacy’s Named Entity Recognition

Procedure

Redaction/Sanitization of Location/GPE

Leave a Comment Cancel Reply