Document Redaction & Sanitization Using SpaCy’s Named Entity Recognition

Automatic Redaction of Document using Spacy’s Named Entity Recognition

In this tutorial we will see how to use spacy to do document redaction and sanitization. So what is document sanitization or redaction?

Sanitization is the process of removing sensitive information from a document or other message (or sometimes encrypting it), so that the document may be distributed to a broader audience.

The purpose of this process is

 

  • For anonymity of source in document
  • To ensure there is no sensitive or personally identifiable information in the document
  • Censorship

Let us see how to achieve this using Named Entity Recognition.

In [1]:
# Load NLP Pkg
import spacy
In [2]:
# Create NLP object
nlp = spacy.load('en')
Procedure
  • Using NER
  • Locate Entities eg.Person or Place
  • Replace with our word
In [3]:
ex1 = "The reporter said that it was John Mark that gave him the news in London last year"
In [4]:
docx1 = nlp(ex1)
In [5]:
# Find Entities
for ent in docx1.ents:
    print(ent.text,ent.label_)
John Mark PERSON
London GPE
last year DATE
In [6]:
# Function to Sanitize/Redact Names
def sanitize_names(text):
    docx = nlp(text)
    redacted_sentences = []
    for ent in docx.ents:
        ent.merge()
    for token in docx:
        if token.ent_type_ == 'PERSON':
            redacted_sentences.append("[REDACTED]")
        else:
            redacted_sentences.append(token.string)
    return "".join(redacted_sentences)
In [7]:
ex1
Out[7]:
'The reporter said that it was John Mark that gave him the news in London last year'
In [8]:
# Redact the Names
sanitize_names(ex1)
Out[8]:
'The reporter said that it was [REDACTED]that gave him the news in London last year'
In [9]:
# Visualization of Entities
from spacy import displacy
In [10]:
displacy.render(nlp(ex1),style='ent',jupyter=True)
The reporter said that it was that gave him the news in
In [11]:
# Apply the function and visualize it
docx2 = sanitize_names(ex1)
In [12]:
displacy.render(nlp(docx2),style='ent',jupyter=True)
The reporter said that it was [REDACTED]that gave him the news in

Redaction/Sanitization of Location/GPE

In [13]:
# Redaction of Location/GPE
def sanitize_locations(text):
    docx = nlp(text)
    redacted_sentences = []
    for ent in docx.ents:
        ent.merge()
    for token in docx:
        if token.ent_type_ == 'GPE':
            redacted_sentences.append("[REDACTED]")
        else:
            redacted_sentences.append(token.string)
    return "".join(redacted_sentences)
In [14]:
sanitize_locations(ex1)
Out[14]:
'The reporter said that it was John Mark that gave him the news in [REDACTED]last year'

 

This is a very basic approach which can be customized to do some interesting document redaction.

You can check the video tutorial below

Thanks  For Reading

Jesus Saves

By Jesse E.Agbe (JCharis)

Leave a Comment

Your email address will not be published. Required fields are marked *