Overview¶
This notebook provides an overview of the package and its functionality.
# used to hide warnings
import warnings
warnings.filterwarnings("ignore")
Let us first define the text, from which we will showcase the package's functionality.
original_text = """\
Medical Record
Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789
Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
Extract personal information from text¶
The anonipy
has implemented entity extraction components, that can be used to extract personal information from text.
More can be found in the chapter Extractors.
Language detector¶
from anonipy.utils.language_detector import LanguageDetector
lang_detector = LanguageDetector()
# identify the language of the original text
language = lang_detector(original_text)
language
('en', 'English')
Extract personal information¶
from anonipy.anonymize.extractors import EntityExtractor
# define the labels to be extracted and anonymized
labels = [
{"label": "name", "type": "string"},
{
"label": "social security number",
"type": "custom",
"regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
},
{"label": "date of birth", "type": "date"},
{"label": "date", "type": "date"},
]
# language taken from the language detector
entity_extractor = EntityExtractor(labels, lang=language, score_th=0.5)
# extract the entities from the original text
doc, entities = entity_extractor(original_text)
# display the entities in the original text
entity_extractor.display(doc)
Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number
Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date
The extracted entities metadata is available in the entities
variable, which are:
entities
[Entity(text='John Doe', label='name', start_index=30, end_index=38, score=0.9961156845092773, type='string', regex='.*'), Entity(text='15-01-1985', label='date of birth', start_index=54, end_index=64, score=0.9937193393707275, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'), Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=0.9867385625839233, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'), Entity(text='123-45-6789', label='social security number', start_index=121, end_index=132, score=0.9993416666984558, type='custom', regex='[0-9]{3}-[0-9]{2}-[0-9]{4}'), Entity(text='John Doe', label='name', start_index=157, end_index=165, score=0.994924783706665, type='string', regex='.*'), Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=0.8285622596740723, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})')]
Anonymize the original text¶
The anonipy
has implemented generators for different types of information, that can be used
to generate replacements for the original text.
More on generators can be found in the chapter Generators, while chapter Strategies provides strategies for anonymizing the original text.
Prepare generators for generating replacements¶
from anonipy.anonymize.generators import (
LLMLabelGenerator,
DateGenerator,
NumberGenerator,
)
# initialize the generators
llm_generator = LLMLabelGenerator()
date_generator = DateGenerator()
number_generator = NumberGenerator()
Loading checkpoint shards: 100%|██████████| 4/4 [00:21<00:00, 5.44s/it]
# prepare the anonymization mapping
def anonymization_mapping(text, entity):
if entity.type == "string":
return llm_generator.generate(entity, temperature=0.7)
if entity.label == "date":
return date_generator.generate(entity, output_gen="middle_of_the_month")
if entity.label == "date of birth":
return date_generator.generate(entity, output_gen="middle_of_the_year")
if entity.label == "social security number":
return number_generator.generate(entity)
return "[REDACTED]"
Anonymize the original text¶
from anonipy.anonymize.strategies import PseudonymizationStrategy
# initialize the pseudonymization strategy
pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)
# anonymize the original text
anonymized_text, replacements = pseudo_strategy.anonymize(original_text, entities)
The anonymized text is:
print(anonymized_text)
Medical Record Patient Name: Ethan Lane Date of Birth: 01-07-1985 Date of Examination: 15-05-2024 Social Security Number: 588-85-9388 Examination Procedure: Ethan Lane underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues. Medication Prescribed: Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief. Lisinopril 10 mg: Take one tablet daily to manage high blood pressure. Next Examination Date: 15-11-2024
And the associated replacements are:
replacements
[{'original_text': '15-11-2024', 'label': 'date', 'start_index': 717, 'end_index': 727, 'anonymized_text': '15-11-2024'}, {'original_text': 'John Doe', 'label': 'name', 'start_index': 157, 'end_index': 165, 'anonymized_text': 'Ethan Lane'}, {'original_text': '123-45-6789', 'label': 'social security number', 'start_index': 121, 'end_index': 132, 'anonymized_text': '588-85-9388'}, {'original_text': '20-05-2024', 'label': 'date', 'start_index': 86, 'end_index': 96, 'anonymized_text': '15-05-2024'}, {'original_text': '15-01-1985', 'label': 'date of birth', 'start_index': 54, 'end_index': 64, 'anonymized_text': '01-07-1985'}, {'original_text': 'John Doe', 'label': 'name', 'start_index': 30, 'end_index': 38, 'anonymized_text': 'Ethan Lane'}]
Fixing the anonymized text¶
In case the anonymized text is not suitable, we can fix it by using the anonymize
function found in the anonipy.anonymize
module.
To do this, let us define a new set of replacements.
We can edit existing replacements by changing the anonymized_text
value, remove the ones that are not suitable,
and add missing ones.
Note that the new set does not require the original_text
and label
values.
new_replacements = [
{
"start_index": 30,
"end_index": 38,
"anonymized_text": "Mark Strong",
},
{
"original_text": "20-05-2024",
"label": "date",
"start_index": 86,
"end_index": 96,
"anonymized_text": "18-05-2024",
},
{
"original_text": "123-45-6789",
"label": "social security number",
"start_index": 121,
"end_index": 132,
"anonymized_text": "119-88-7014",
},
{
"original_text": "John Doe",
"label": "name",
"start_index": 157,
"end_index": 165,
"anonymized_text": "Mark Strong",
},
]
Now, let us anonymize the original text using the new replacements.
from anonipy.anonymize import anonymize
# anonymize the original text using the new replacements
anonymized_text, replacements = anonymize(original_text, new_replacements)
print(anonymized_text)
Medical Record Patient Name: Mark Strong Date of Birth: 15-01-1985 Date of Examination: 18-05-2024 Social Security Number: 119-88-7014 Examination Procedure: Mark Strong underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues. Medication Prescribed: Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief. Lisinopril 10 mg: Take one tablet daily to manage high blood pressure. Next Examination Date: 15-11-2024