Automatic Redaction of Document using Spacy’s Named Entity Recognition
In this tutorial we will see how to use spacy to do document redaction and sanitization. So what is document sanitization or redaction?
Sanitization is the process of removing sensitive information from a document or other message (or sometimes encrypting it), so that the document may be distributed to a broader audience.
The purpose of this process is
- For anonymity of source in document
- To ensure there is no sensitive or personally identifiable information in the document
- Censorship
Let us see how to achieve this using Named Entity Recognition.
In [1]:
# Load NLP Pkg
import spacy
In [2]:
# Create NLP object
nlp = spacy.load('en')
Procedure
- Using NER
- Locate Entities eg.Person or Place
- Replace with our word
In [3]:
ex1 = "The reporter said that it was John Mark that gave him the news in London last year"
In [4]:
docx1 = nlp(ex1)
In [5]:
# Find Entities
for ent in docx1.ents:
print(ent.text,ent.label_)
In [6]:
# Function to Sanitize/Redact Names
def sanitize_names(text):
docx = nlp(text)
redacted_sentences = []
for ent in docx.ents:
ent.merge()
for token in docx:
if token.ent_type_ == 'PERSON':
redacted_sentences.append("[REDACTED]")
else:
redacted_sentences.append(token.string)
return "".join(redacted_sentences)
In [7]:
ex1
Out[7]:
In [8]:
# Redact the Names
sanitize_names(ex1)
Out[8]:
In [9]:
# Visualization of Entities
from spacy import displacy
In [10]:
displacy.render(nlp(ex1),style='ent',jupyter=True)
In [11]:
# Apply the function and visualize it
docx2 = sanitize_names(ex1)
In [12]:
displacy.render(nlp(docx2),style='ent',jupyter=True)
Redaction/Sanitization of Location/GPE
In [13]:
# Redaction of Location/GPE
def sanitize_locations(text):
docx = nlp(text)
redacted_sentences = []
for ent in docx.ents:
ent.merge()
for token in docx:
if token.ent_type_ == 'GPE':
redacted_sentences.append("[REDACTED]")
else:
redacted_sentences.append(token.string)
return "".join(redacted_sentences)
In [14]:
sanitize_locations(ex1)
Out[14]:
This is a very basic approach which can be customized to do some interesting document redaction.
You can check the video tutorial below
Thanks For Reading
Jesus Saves
By Jesse E.Agbe (JCharis)