Overview¶

This notebook provides an overview of the package and its functionality.

In [1]:

Copied!

# used to hide warnings
import warnings

warnings.filterwarnings("ignore")
# used to hide warnings
import warnings

warnings.filterwarnings("ignore")

Let us first define the text, from which we will showcase the package's functionality.

In [2]:

Copied!





original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""

Extract personal information from text¶

The anonipy has implemented entity extraction components, that can be used to extract personal information from text.

More can be found in the chapter Extractors.

Language detector¶

In [3]:

Copied!

from anonipy.utils.language_detector import LanguageDetector

lang_detector = LanguageDetector()
from anonipy.utils.language_detector import LanguageDetector

lang_detector = LanguageDetector()

In [4]:

Copied!

# identify the language of the original text
language = lang_detector(original_text)
language
# identify the language of the original text
language = lang_detector(original_text)
language

Out[4]:

('en', 'English')

Extract personal information¶

In [5]:

Copied!

from anonipy.anonymize.extractors import EntityExtractor
from anonipy.anonymize.extractors import EntityExtractor

In [6]:

Copied!





# define the labels to be extracted and anonymized
labels = [
    {"label": "name", "type": "string"},
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]
# define the labels to be extracted and anonymized
labels = [
    {"label": "name", "type": "string"},
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

In [7]:

Copied!

# language taken from the language detector
entity_extractor = EntityExtractor(labels, lang=language, score_th=0.5)
# language taken from the language detector
entity_extractor = EntityExtractor(labels, lang=language, score_th=0.5)

In [8]:

Copied!

# extract the entities from the original text
doc, entities = entity_extractor(original_text)
# extract the entities from the original text
doc, entities = entity_extractor(original_text)

In [9]:

Copied!

# display the entities in the original text
entity_extractor.display(doc)
# display the entities in the original text
entity_extractor.display(doc)

Medical Record

Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number

Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date

The extracted entities metadata is available in the entities variable, which are:

In [10]:

Copied!

entities
entities

Out[10]:

[Entity(text='John Doe', label='name', start_index=30, end_index=38, score=0.9961156845092773, type='string', regex='.*'),
 Entity(text='15-01-1985', label='date of birth', start_index=54, end_index=64, score=0.9937193393707275, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=0.9867385625839233, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='123-45-6789', label='social security number', start_index=121, end_index=132, score=0.9993416666984558, type='custom', regex='[0-9]{3}-[0-9]{2}-[0-9]{4}'),
 Entity(text='John Doe', label='name', start_index=157, end_index=165, score=0.994924783706665, type='string', regex='.*'),
 Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=0.8285622596740723, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})')]

Anonymize the original text¶

The anonipy has implemented generators for different types of information, that can be used to generate replacements for the original text.

More on generators can be found in the chapter Generators, while chapter Strategies provides strategies for anonymizing the original text.

Prepare generators for generating replacements¶

In [11]:

Copied!





from anonipy.anonymize.generators import (
    LLMLabelGenerator,
    DateGenerator,
    NumberGenerator,
)
from anonipy.anonymize.generators import (
    LLMLabelGenerator,
    DateGenerator,
    NumberGenerator,
)

In [12]:

Copied!





# initialize the generators
llm_generator = LLMLabelGenerator()
date_generator = DateGenerator()
number_generator = NumberGenerator()
# initialize the generators
llm_generator = LLMLabelGenerator()
date_generator = DateGenerator()
number_generator = NumberGenerator()

Loading checkpoint shards: 100%|██████████| 4/4 [00:21<00:00,  5.44s/it]

In [13]:

Copied!





# prepare the anonymization mapping
def anonymization_mapping(text, entity):
    if entity.type == "string":
        return llm_generator.generate(entity, temperature=0.7)
    if entity.label == "date":
        return date_generator.generate(entity, output_gen="middle_of_the_month")
    if entity.label == "date of birth":
        return date_generator.generate(entity, output_gen="middle_of_the_year")
    if entity.label == "social security number":
        return number_generator.generate(entity)
    return "[REDACTED]"
# prepare the anonymization mapping
def anonymization_mapping(text, entity):
    if entity.type == "string":
        return llm_generator.generate(entity, temperature=0.7)
    if entity.label == "date":
        return date_generator.generate(entity, output_gen="middle_of_the_month")
    if entity.label == "date of birth":
        return date_generator.generate(entity, output_gen="middle_of_the_year")
    if entity.label == "social security number":
        return number_generator.generate(entity)
    return "[REDACTED]"

Anonymize the original text¶

In [14]:

Copied!

from anonipy.anonymize.strategies import PseudonymizationStrategy
from anonipy.anonymize.strategies import PseudonymizationStrategy

In [15]:

Copied!

# initialize the pseudonymization strategy
pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)
# initialize the pseudonymization strategy
pseudo_strategy = PseudonymizationStrategy(mapping=anonymization_mapping)

In [16]:

Copied!

# anonymize the original text
anonymized_text, replacements = pseudo_strategy.anonymize(original_text, entities)
# anonymize the original text
anonymized_text, replacements = pseudo_strategy.anonymize(original_text, entities)

The anonymized text is:

In [17]:

Copied!

print(anonymized_text)
print(anonymized_text)

Medical Record

Patient Name: Ethan Lane
Date of Birth: 01-07-1985
Date of Examination: 15-05-2024
Social Security Number: 588-85-9388

Examination Procedure:
Ethan Lane underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024

And the associated replacements are:

In [18]:

Copied!

replacements
replacements

Out[18]:

[{'original_text': '15-11-2024',
  'label': 'date',
  'start_index': 717,
  'end_index': 727,
  'anonymized_text': '15-11-2024'},
 {'original_text': 'John Doe',
  'label': 'name',
  'start_index': 157,
  'end_index': 165,
  'anonymized_text': 'Ethan Lane'},
 {'original_text': '123-45-6789',
  'label': 'social security number',
  'start_index': 121,
  'end_index': 132,
  'anonymized_text': '588-85-9388'},
 {'original_text': '20-05-2024',
  'label': 'date',
  'start_index': 86,
  'end_index': 96,
  'anonymized_text': '15-05-2024'},
 {'original_text': '15-01-1985',
  'label': 'date of birth',
  'start_index': 54,
  'end_index': 64,
  'anonymized_text': '01-07-1985'},
 {'original_text': 'John Doe',
  'label': 'name',
  'start_index': 30,
  'end_index': 38,
  'anonymized_text': 'Ethan Lane'}]

Fixing the anonymized text¶

In case the anonymized text is not suitable, we can fix it by using the anonymize function found in the anonipy.anonymize module. To do this, let us define a new set of replacements.

We can edit existing replacements by changing the anonymized_text value, remove the ones that are not suitable, and add missing ones.

Note that the new set does not require the original_text and label values.

In [19]:

Copied!





new_replacements = [
    {
        "start_index": 30,
        "end_index": 38,
        "anonymized_text": "Mark Strong",
    },
    {
        "original_text": "20-05-2024",
        "label": "date",
        "start_index": 86,
        "end_index": 96,
        "anonymized_text": "18-05-2024",
    },
    {
        "original_text": "123-45-6789",
        "label": "social security number",
        "start_index": 121,
        "end_index": 132,
        "anonymized_text": "119-88-7014",
    },
    {
        "original_text": "John Doe",
        "label": "name",
        "start_index": 157,
        "end_index": 165,
        "anonymized_text": "Mark Strong",
    },
]
new_replacements = [
    {
        "start_index": 30,
        "end_index": 38,
        "anonymized_text": "Mark Strong",
    },
    {
        "original_text": "20-05-2024",
        "label": "date",
        "start_index": 86,
        "end_index": 96,
        "anonymized_text": "18-05-2024",
    },
    {
        "original_text": "123-45-6789",
        "label": "social security number",
        "start_index": 121,
        "end_index": 132,
        "anonymized_text": "119-88-7014",
    },
    {
        "original_text": "John Doe",
        "label": "name",
        "start_index": 157,
        "end_index": 165,
        "anonymized_text": "Mark Strong",
    },
]

Now, let us anonymize the original text using the new replacements.

In [20]:

Copied!

from anonipy.anonymize import anonymize
from anonipy.anonymize import anonymize

In [21]:

Copied!

# anonymize the original text using the new replacements
anonymized_text, replacements = anonymize(original_text, new_replacements)
# anonymize the original text using the new replacements
anonymized_text, replacements = anonymize(original_text, new_replacements)

In [22]:

Copied!

print(anonymized_text)
print(anonymized_text)

Medical Record

Patient Name: Mark Strong
Date of Birth: 15-01-1985
Date of Examination: 18-05-2024
Social Security Number: 119-88-7014

Examination Procedure:
Mark Strong underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024

In [ ]: