Skip to content

Extractors overview

In this post, we will show an overview of the implemented extractors. The extractors are used to extract relevant named entities from text. These entities can be people names, organizations, addresses, social security numbers, etc. The entities are then used to anonymize the text.

All extractors and their API references are available in the extractors module. What follows is the presentation of the different extractors anonipy provides.

Pre-requisites

Let us first define the text, from which we want to extract the entities.

original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""

Language configuration

Each extractor requires a language to be configured. The language is used to determine how to process the text. If the language is not specified, the extractor will use the default language. The default language is ENGLISH.

To make it easier to switch languages, we can use the LANGUAGES constant.

from anonipy.constants import LANGUAGES

LANGUAGE.ENGLISH# (1)!
  1. The LANGUAGE.ENGLISH return the ("en", "English") literal tuple, which is the format required by the extractors.

Using the language detector

An alternative is to use a language detector available in the language_detector module. The detector utilizes the lingua python package, and allows automatic detection of the language of the text.

from anonipy.utils.language_detector import LanguageDetector

# initialize the language detector and detect the language
language_detector = LanguageDetector()
language_detector(original_text)# (1)!
  1. The language_detector returns the literal tuple ("en", "English"), similar to the LANGUAGE.ENGLISH, making it compatible with the extractors.

Named Entity

Each extractor will extract the named entities from the text. The entities can be people names, organizations, addresses, social security numbers, etc. The entities are represented using the Entity dataclass, which consists of the following parameters:

Attributes:

Name Type Description
text str

The text of the entity.

label str

The label of the entity.

start_index int

The start index of the entity in the text.

end_index int

The end index of the entity in the text.

score float

The prediction score of the entity. The score is returned by the extractor models.

type ENTITY_TYPES

The type of the entity.

regex Union[str, Pattern]

The regular expression the entity must match.

get_regex_group()

Returns:

Type Description
Union[str, None]

The regex group.

Extractors

All following extractors are available in the extractors module.

Named entity recognition (NER) extractor

The NERExtractor extractor uses a span-based NER model to identify the relevant entities in the text. Furthermore, it uses the GLiNER span-based NER model, specifically the model finetuned for recognizing Personal Identifiable Information (PII) within text. The model has been finetuned on six languages (English, French, German, Spanish, Italian, and Portuguese), but can be applied also to other languages.

from anonipy.anonymize.extractors import NERExtractor

The NERExtractor takes the following input parameters:

Parameters:

Name Type Description Default
labels List[dict]

The list of labels to extract.

required
lang LANGUAGES

The language of the text to extract.

ENGLISH
score_th float

The score threshold. Entities with a score below this threshold will be ignored.

0.5
use_gpu bool

Whether to use GPU.

False
gliner_model str

The gliner model to use to identify the entities.

'urchade/gliner_multi_pii-v1'
spacy_style str

The style the entities should be stored in the spacy doc. Options: ent or span.

'ent'

We must define the labels to be extracted and their types. In this example, we will extract the following entities:

labels = [
    {"label": "name", "type": "string"},
    {"label": "social security number", "type": "custom", "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}"},
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

Let us now initialize the entity extractor.

ner_extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)

Initialization warnings

The initialization of NERExtractor will throw some warnings. Ignore them. These are expected due to the use of package dependencies.

The NERExtractor receives the text to be anonymized and returns the enriched text document and the extracted entities.

doc, entities = ner_extractor(original_text)

The entities extracted within the input text are:

ner_extractor.display(doc)
Medical Record

Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number

Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date

Advices and suggestions

Use specific label names. In the above example, we used specific label names to extract the entities. If we use a less specific name, the entity extractor might not find any relevant entity.

For instance, when using social security number as the label name, the entity extractor is able to extract the social security number from the text. However, if we use ssn or just number as the label name, the entity extractor might not find any relevant entity.

Tip

Using more specific label names is better.

Use custom regex patterns. In the anonipy package, we provide some predefined ENTITY_TYPES, which are:

Attributes:

Name Type Description
CUSTOM Literal[custom]

The custom entity type.

STRING Literal[string]

The string entity type.

INTEGER Literal[integer]

The integer entity type.

FLOAT Literal[float]

The float entity type.

DATE Literal[date]

The date entity type.

EMAIL Literal[email]

The email entity type.

WEBSITE_URL Literal[website_url]

The website url entity type.

PHONE_NUMBER Literal[phone_number]

The phone number entity type.

These entity types also have a corresponding regex pattern, as defined in the regex submodule.

If the user wants to use a custom regex pattern, they can define it in the labels variable list. Using a custom regex pattern allows the user to specify a more strict pattern that the entity must match.

Pattern extractor

The PatternExtractor is an extractor that uses a custom spacy and regex pattern to extract entities. When documents have a consistent format and structure, the pattern extractor can be useful, as it can extract entities in a consistent way.

from anonipy.anonymize.extractors import PatternExtractor

The PatternExtractor takes the following parameters:

Parameters:

Name Type Description Default
labels List[dict]

The list of labels and patterns to extract.

required
lang LANGUAGES

The language of the text to extract.

ENGLISH
spacy_style str

The style the entities should be stored in the spacy doc. Options: ent or span.

'ent'

We must define the labels and their patterns used to extract the relevant entities. The patterns are defined using spacy patterns or regex patterns.

In this example, we will use the following labels and patterns:

labels = [
    # the pattern is defined using regex patterns, where the paranthesis are used to indicate core entity values
    {"label": "symptoms", "regex": r"\((.*)\)"},
    # the pattern is defined using spacy patterns
    {
        "label": "medicine",
        "pattern": [[{"IS_ALPHA": True}, {"LIKE_NUM": True}, {"LOWER": "mg"}]],
    },
    # the pattern is defined using spacy patterns
    {
        "label": "date",
        "pattern": [
            [
                {"SHAPE": "dd"},
                {"TEXT": "-"},
                {"SHAPE": "dd"},
                {"TEXT": "-"},
                {"SHAPE": "dddd"},
            ]
        ],
    },
]

Let us now initialize the pattern extractor.

pattern_extractor = PatternExtractor(labels, lang=LANGUAGES.ENGLISH)

The PatternExtractor receives the original text and returns the enriched text document and the extracted entities.

doc, entities = pattern_extractor(original_text)

The entities extracted within the input text are:

pattern_extractor.display(doc)
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985 date
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs ( blood pressure, heart rate, temperature symptoms ), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg medicine : Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg medicine : Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date

Multi extractor

The MultiExtractor is a extractor that can be used to extract entities using multiple extractors.

The motivation behind the multi extractor is the following: depending on the document format, personal information can be located in different locations; some of them can be found at similar places, while others can be found in different places and formats. Because of this, we would need to use the NERExtractor to automatically identify the entities at different locations and the PatternExtractor to extract the entities that appear at the same location.

MultiExtractor enables the use of both extractors at the same time. Furthermore, if both extractors identify entities at similar locations, then the MultiExtractor will also provide a list of joint entities.

from anonipy.anonymize.extractors import MultiExtractor

The MultiExtractor takes the following parameters:

Parameters:

Name Type Description Default
extractors List[ExtractorInterface]

The list of extractors to use.

required

In this example, we will use the previously initialized NER and pattern extractors.

multi_extractor = MultiExtractor(
  extractors=[ner_extractor, pattern_extractor],
)

Similar as before, the MultiExtractor receives the original text, but returns the outputs of all the extractors, as well as the joint entities from all the extractors.

extractor_outputs, joint_entities = multi_extractor(original_text)

In this case, extractor_outputs[0] will contain the (doc, entities) from the NER extractor, and extractor_outputs[1] will contain the (doc, entities) from the pattern extractor. The joint_entities will contain the joint entities from all the extractors.

Conclusion

The extractors are used to extract entities from the text. The anonipy package supports both machine learning-based and pattern-based entity extraction, enabling information identification and extraction from different textual formats.