Extractors¶

This chapter showcases how to use the label extractors in the package.

The label extractors are used to extract relevant named entities from text. These entities can be people names, organizations, addresses, social security numbers, etc. The entities are then used to anonymize the text.

In [1]:

Copied!

# used to hide warnings
import warnings

warnings.filterwarnings("ignore")
# used to hide warnings
import warnings

warnings.filterwarnings("ignore")

Let us first define the text, from which we want to extract the entities.

In [2]:

Copied!





original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""

Language configuration¶

First, we must specify the language that the text is written in. We can do this manually or by using a language detector.

Manual selection¶

One option, when all of the texts are in the same language, is to use manually specifying the text language. In the anonipy package, we provide a constant called LANGUAGES in the constants submodule, which contains all the supported languages. Please find the format of the language code in the constants module.

Since the original_text is in English, we will use the LANGUAGES.ENGLISH predefined constant.

In [3]:

Copied!

from anonipy.constants import LANGUAGES
from anonipy.constants import LANGUAGES

In [4]:

Copied!

LANGUAGES.ENGLISH
LANGUAGES.ENGLISH

Out[4]:

('en', 'English')

Using language detector¶

An alternative is to use a language detector available in the anonipy package. The language detector is created using the lingua python package, and allows automatic detection of the text language.

In [5]:

Copied!

from anonipy.utils.language_detector import LanguageDetector
from anonipy.utils.language_detector import LanguageDetector

Initialize the language detector and use it to automatically detect the language of the text.

In [6]:

Copied!

lang_detector = LanguageDetector()
lang_detector(original_text)
lang_detector = LanguageDetector()
lang_detector(original_text)

Out[6]:

('en', 'English')

Using extractors¶

Initialization¶

We can now initialize the label extractors. This is done using the EntityExtractor class found in anonipy.anonymize.extractors submodule.

Info

The EntityExtractor class is created using the GLiNER models, specifically the one that is finetuned for recognizing Personally Identifiable Information (PII) within text. The model has been finetuned on six languages (English, French, German, Spanish, Italian, and Portuguese), but can be applied also to other languages.

In [7]:

Copied!

from anonipy.anonymize.extractors import EntityExtractor
from anonipy.anonymize.extractors import EntityExtractor

The EntityExtractor class takes the following arguments:

labels: A list of dictionaries containing the labels to be extracted.
lang: The language of the text to be anonymized. Defaults to LANGUAGES.ENGLISH.
score_th: The score threshold used to filter the labels, i.e. the entity has to have a score greater than score_th to be considered. Defaults to 0.5.
use_gpu: Whether to use the GPU. Defaults to False.

We must now define the labels to be extracted. In this example, we will extract the people name, the dates, and the social security number from the text.

In [8]:

Copied!





labels = [
    {"label": "name", "type": "string"},
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]
labels = [
    {"label": "name", "type": "string"},
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

Let us now initialize the entity extractor.

Info

The initialization of EntityExtractor will throw some warnings. Ignore them. These are expected due to the use of package dependencies.

In [9]:

Copied!

entity_extractor = EntityExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)
entity_extractor = EntityExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)

Entity extraction¶

The EntityExtractor receives the text to be anonymized and returns the enriched text document and the extracted entities.

In [10]:

Copied!

doc, entities = entity_extractor(original_text)
doc, entities = entity_extractor(original_text)

The entities extracted within the input text are:

In [11]:

Copied!

entity_extractor.display(doc)
entity_extractor.display(doc)

Medical Record

Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number

Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date

The extracted entities are stored in the entities variable. Each entity contains the following information:

text: The text of the entity.
label: The label of the entity.
start_index: The start index of the entity in the text.
end_index: The end index of the entity in the text.
score: The score of the entity. It shows how certain the model is that the entity is relevant.
type: The type of the entity (taken from the defined labels variable list).
regex: The regular expression the entity must match.

In [12]:

Copied!

entities
entities

Out[12]:

[Entity(text='John Doe', label='name', start_index=30, end_index=38, score=0.9961156845092773, type='string', regex='.*'),
 Entity(text='15-01-1985', label='date of birth', start_index=54, end_index=64, score=0.9937193393707275, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=0.9867385625839233, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'),
 Entity(text='123-45-6789', label='social security number', start_index=121, end_index=132, score=0.9993416666984558, type='custom', regex='[0-9]{3}-[0-9]{2}-[0-9]{4}'),
 Entity(text='John Doe', label='name', start_index=157, end_index=165, score=0.994924783706665, type='string', regex='.*'),
 Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=0.8285622596740723, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})')]

Advices and suggestions¶

Use specific label names. In the above example, we used specific label names to extract the entities. If we use a less specific name, the entity extractor might not find any relevant entity.

For instance, when using social security number as the label name, the entity extractor is able to extract the social security number from the text. However, if we use ssn or just number as the label name, the entity extractor might not find any relevant entity.

Tip

Using more specific label names is better.

Use custom regex patterns. In the anonipy package, we provide some predefined entity types, which are:

string. Extracts a string from the text.
integer. Extracts an integer from the text.
float. Extracts a float from the text.
date. Extracts a date from the text.
email. Extracts an email address from the text.
phone_number. Extracts a phone number from the text.
website_url. Extracts an URL from the text.

These entity types also have a corresponding regex pattern, as defined in the anonipy.anonymize.regex submodule.

In [13]:

Copied!





from anonipy.anonymize.regex import regex_map

for type in [
    "string",
    "integer",
    "float",
    "date",
    "email",
    "phone_number",
    "website_url",
]:
    print(f"{type:<13}: {regex_map(type)}")
from anonipy.anonymize.regex import regex_map

for type in [
    "string",
    "integer",
    "float",
    "date",
    "email",
    "phone_number",
    "website_url",
]:
    print(f"{type:<13}: {regex_map(type)}")

string       : .*
integer      : \d+
float        : [\d\.,]+
date         : (\d{1,2}[\/\-\.]\d{1,2}[\/\-\.]\d{2,4})|(\d{2,4}[\/\-\.]\d{1,2}[\/\-\.]\d{1,2})
email        : [a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
phone_number : [(]?[\+]?[(]?[0-9]{1,3}[)]?[-\s\.]?([0-9]{2,}[-\s\.]?){2,}([0-9]{3,})
website_url  : ((https?|ftp|smtp):\/\/)?(www.)?([a-zA-Z0-9]+\.)+[a-z]{2,}(\/[a-zA-Z0-9#\?\_\.\=\-\&]+|\/?)*

If the user wants to use a custom regex pattern, they can define it in the labels variable list. Using a custom regex pattern allows the user to specify a more strict pattern that the entity must match.

The custom regex can be specified in the following way:

In [14]:

Copied!





labels = [
    {"label": "name", "type": "string"},
    # using the custom regex pattern: type must be 'custom' and specify the regex pattern in the 'regex' key
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]
labels = [
    {"label": "name", "type": "string"},
    # using the custom regex pattern: type must be 'custom' and specify the regex pattern in the 'regex' key
    {
        "label": "social security number",
        "type": "custom",
        "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
    },
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

Lets rerun the above example:

In [15]:

Copied!

# ignore the warnings: these are expected due to the use of package dependencies
entity_extractor = EntityExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)
# ignore the warnings: these are expected due to the use of package dependencies
entity_extractor = EntityExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)

In [16]:

Copied!

doc, entities = entity_extractor(original_text)
doc, entities = entity_extractor(original_text)

The extracted entities are the same as before. The difference is that the social security number now also had to match the custom regex pattern.

In [17]:

Copied!

entity_extractor.display(doc)
entity_extractor.display(doc)

Medical Record

Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number

Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date

Creating custom extractors¶

The user can develop their own custom extractor. To do this, the custom extractor must inherit from the ExtractorInterface class.

The extractor must have two methods defined: __init__ and __call__.

An example of a custom extractor that extracts only a specific regex pattern from the text is shown below:

In [20]:

Copied!





import re
from anonipy.anonymize.extractors import ExtractorInterface
from anonipy.definitions import Entity


class CustomExtractor(ExtractorInterface):

    def __init__(self):
        # the custom extractor will retrieve entities that follow the regex pattern
        self.regex_pattern = re.compile(r"\d{1,2}-\d{1,2}-\d{2,4}")

    def __call__(self, text: str) -> tuple[str, list[Entity]]:
        entities = []
        for match in re.finditer(self.regex_pattern, text):
            entities.append(
                Entity(
                    text=match.group(),
                    label="date",
                    start_index=match.start(),
                    end_index=match.end(),
                    score=1.0,
                    type="date",
                    regex=self.regex_pattern,
                )
            )
        return text, entities
import re
from anonipy.anonymize.extractors import ExtractorInterface
from anonipy.definitions import Entity


class CustomExtractor(ExtractorInterface):

    def __init__(self):
        # the custom extractor will retrieve entities that follow the regex pattern
        self.regex_pattern = re.compile(r"\d{1,2}-\d{1,2}-\d{2,4}")

    def __call__(self, text: str) -> tuple[str, list[Entity]]:
        entities = []
        for match in re.finditer(self.regex_pattern, text):
            entities.append(
                Entity(
                    text=match.group(),
                    label="date",
                    start_index=match.start(),
                    end_index=match.end(),
                    score=1.0,
                    type="date",
                    regex=self.regex_pattern,
                )
            )
        return text, entities

In [21]:

Copied!

custom_extractor = CustomExtractor()
_, entities = custom_extractor(original_text)
custom_extractor = CustomExtractor()
_, entities = custom_extractor(original_text)

Let us output the extracted entities. Note that the third entity corresponds to a part of the social security number.

In [22]:

Copied!

entities
entities

Out[22]:

[Entity(text='15-01-1985', label='date', start_index=54, end_index=64, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')),
 Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')),
 Entity(text='23-45-6789', label='date', start_index=122, end_index=132, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')),
 Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}'))]

In [ ]: