Strategies¶
This chapter showcases how to use the anonymization strategies in the package.
The main motivation behind the anonymization strategies is to streamline the process of data anonymization. The anonipy
package implements strategies, which can be found in the anonipy.anonymize.strategies
module.
Furthermore, each strategy has an associated anonymize
method, which returns the anonymized text and the list of anonymized entities showing which part of the text was anonymized and with which replacement.
# used to hide warnings
import warnings
warnings.filterwarnings("ignore")
Let us first define the text and the associated entities, as seen in the previous chapter (see Extractors).
original_text = """\
Medical Record
Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789
Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
Normally, the entities are extracted using the the EntityExtractor
. For this section,
we manually define the entities:
from anonipy.definitions import Entity
entities = [
Entity(
text="John Doe",
label="name",
start_index=30,
end_index=38,
score=1.0,
type="string",
regex=".*",
),
Entity(
text="15-01-1985",
label="date of birth",
start_index=54,
end_index=64,
score=1.0,
type="date",
regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
),
Entity(
text="20-05-2024",
label="date",
start_index=86,
end_index=96,
score=1.0,
type="date",
regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
),
Entity(
text="123-45-6789",
label="social security number",
start_index=121,
end_index=132,
score=1.0,
type="custom",
regex="[0-9]{3}-[0-9]{2}-[0-9]{4}",
),
Entity(
text="John Doe",
label="name",
start_index=157,
end_index=165,
score=1.0,
type="string",
regex=".*",
),
Entity(
text="15-11-2024",
label="date",
start_index=717,
end_index=727,
score=1.0,
type="date",
regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
),
]
RedactionStrategy¶
Data redaction is the process of obscuring information that’s personally identifiable, confidential, classified or sensitive.
The RedactionStrategy
anonymizes the original text by replacing the entities in the text with a predefined substitute label, which defaults to [REDACTED]
.
Info
The redaction strategy hides sensitive information by replacing the original entities with a string that does not reveal any information about the original. While this is useful for obscuring information, it does change the text's distribution, which can effect the training of machine learning models.
from anonipy.anonymize.strategies import RedactionStrategy
redaction_strategy = RedactionStrategy(substitute_label="[REDACTED]")
Using the strategy, we can anonymize the text:
anonymized_text, replacements = redaction_strategy.anonymize(original_text, entities)
The anonymized text is:
print(anonymized_text)
Medical Record Patient Name: [REDACTED] Date of Birth: [REDACTED] Date of Examination: [REDACTED] Social Security Number: [REDACTED] Examination Procedure: [REDACTED] underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues. Medication Prescribed: Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief. Lisinopril 10 mg: Take one tablet daily to manage high blood pressure. Next Examination Date: [REDACTED]
And the associated replacements are:
replacements
[{'original_text': 'John Doe', 'label': 'name', 'start_index': 30, 'end_index': 38, 'anonymized_text': '[REDACTED]'}, {'original_text': '15-01-1985', 'label': 'date of birth', 'start_index': 54, 'end_index': 64, 'anonymized_text': '[REDACTED]'}, {'original_text': '20-05-2024', 'label': 'date', 'start_index': 86, 'end_index': 96, 'anonymized_text': '[REDACTED]'}, {'original_text': '123-45-6789', 'label': 'social security number', 'start_index': 121, 'end_index': 132, 'anonymized_text': '[REDACTED]'}, {'original_text': 'John Doe', 'label': 'name', 'start_index': 157, 'end_index': 165, 'anonymized_text': '[REDACTED]'}, {'original_text': '15-11-2024', 'label': 'date', 'start_index': 717, 'end_index': 727, 'anonymized_text': '[REDACTED]'}]
MaskingStrategy¶
Data masking refers to the disclosure of data with modified values. Data anonymization is done by creating a mirror image of a database and implementing alteration strategies, such as character shuffling, encryption, term, or character substitution. For example, a value character may be replaced by a symbol such as “*” or “x.” It makes identification or reverse engineering difficult.
The MaskingStrategy
anonymizes the original text by replacing the entities with masks, which are created using the subsitute label, which defaults to *
.
Info
The masking strategy is useful as it hides the original sensitive values and retains the original text's length. However, it also changes the original text's meaning and distribution, as the replacement values are not the same as the original values.
from anonipy.anonymize.strategies import MaskingStrategy
masking_strategy = MaskingStrategy(substitute_label="*")
Using the strategy, we can anonymize the text:
anonymized_text, replacements = masking_strategy.anonymize(original_text, entities)
The anonymized text is:
print(anonymized_text)
Medical Record Patient Name: **** *** Date of Birth: ********** Date of Examination: ********** Social Security Number: *********** Examination Procedure: **** *** underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues. Medication Prescribed: Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief. Lisinopril 10 mg: Take one tablet daily to manage high blood pressure. Next Examination Date: **********
And the associated replacements are:
replacements
[{'original_text': 'John Doe', 'label': 'name', 'start_index': 30, 'end_index': 38, 'anonymized_text': '**** ***'}, {'original_text': '15-01-1985', 'label': 'date of birth', 'start_index': 54, 'end_index': 64, 'anonymized_text': '**********'}, {'original_text': '20-05-2024', 'label': 'date', 'start_index': 86, 'end_index': 96, 'anonymized_text': '**********'}, {'original_text': '123-45-6789', 'label': 'social security number', 'start_index': 121, 'end_index': 132, 'anonymized_text': '***********'}, {'original_text': 'John Doe', 'label': 'name', 'start_index': 157, 'end_index': 165, 'anonymized_text': '**** ***'}, {'original_text': '15-11-2024', 'label': 'date', 'start_index': 717, 'end_index': 727, 'anonymized_text': '**********'}]
PseudonymizationStrategy¶
Pseudonymization is a data de-identification tool that substitutes private identifiers with false identifiers or pseudonyms, such as swapping the “John Smith” identifier with the “Mark Spencer” identifier. It maintains statistical precision and data confidentiality, allowing changed data to be used for creation, training, testing, and analysis, while at the same time maintaining data privacy.
The PseudonymizationStrategy
anonymizes the original text by replacing the entities with fake ones, which are created using the generators (see Generators).
Info
The pseudonymization strategy is the most useful in terms of retaining the statistical distributions of the text. However, it is also most technical, as the user must define a function for mapping true entities to fake ones. Furthermore, if an entity appears multiple times the pseudonymization strategy will retain the same mapping between the true and fake entities.
The PseudonymizationStrategy
requires a function for mapping entities. In our example, we will define a function using the generators.
To make the example accessible as possible, we will use the MaskLabelGenerator
instead of the LLMLabelGenerator for generating string entities.
from anonipy.anonymize.generators import (
MaskLabelGenerator,
DateGenerator,
NumberGenerator,
)
mask_generator = MaskLabelGenerator()
date_generator = DateGenerator()
number_generator = NumberGenerator()
Some weights of the model checkpoint at FacebookAI/xlm-roberta-large were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] - This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
def anonymization_mapping(text, entity):
if entity.type == "string":
return mask_generator.generate(entity, text)
if entity.label == "date":
return date_generator.generate(entity, output_gen="middle_of_the_month")
if entity.label == "date of birth":
return date_generator.generate(entity, output_gen="middle_of_the_year")
if entity.label == "social security number":
return number_generator.generate(entity)
return "[REDACTED]"
Let us initialize the strategy:
from anonipy.anonymize.strategies import PseudonymizationStrategy
pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)
Using the strategy, we can anonymize the text:
anonymized_text, replacements = pseudo_strategy.anonymize(original_text, entities)
The anonymized text is:
print(anonymized_text)
Medical Record Patient Name: first Professor Date of Birth: 01-07-1985 Date of Examination: 15-05-2024 Social Security Number: 724-78-8182 Examination Procedure: first Professor underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues. Medication Prescribed: Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief. Lisinopril 10 mg: Take one tablet daily to manage high blood pressure. Next Examination Date: 15-11-2024
And the associated replacements are:
replacements
[{'original_text': 'John Doe', 'label': 'name', 'start_index': 30, 'end_index': 38, 'anonymized_text': 'first Professor'}, {'original_text': '15-01-1985', 'label': 'date of birth', 'start_index': 54, 'end_index': 64, 'anonymized_text': '01-07-1985'}, {'original_text': '20-05-2024', 'label': 'date', 'start_index': 86, 'end_index': 96, 'anonymized_text': '15-05-2024'}, {'original_text': '123-45-6789', 'label': 'social security number', 'start_index': 121, 'end_index': 132, 'anonymized_text': '724-78-8182'}, {'original_text': 'John Doe', 'label': 'name', 'start_index': 157, 'end_index': 165, 'anonymized_text': 'first Professor'}, {'original_text': '15-11-2024', 'label': 'date', 'start_index': 717, 'end_index': 727, 'anonymized_text': '15-11-2024'}]