Extractors overview

In this post, we will show an overview of the implemented extractors. The extractors are used to extract relevant named entities from text. These entities can be people names, organizations, addresses, social security numbers, etc. The entities are then used to anonymize the text.

All extractors and their API references are available in the extractors module. What follows is the presentation of the different extractors anonipy provides.

Pre-requisites

Let us first define the text, from which we want to extract the entities.

original_text = """\
Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""

Language configuration

Each extractor requires a language to be configured. The language is used to determine how to process the text. If the language is not specified, the extractor will use the default language. The default language is ENGLISH.

To make it easier to switch languages, we can use the LANGUAGES constant.

from anonipy.constants import LANGUAGES

LANGUAGE.ENGLISH# (1)!

The LANGUAGE.ENGLISH return the ("en", "English") literal tuple, which is the format required by the extractors.

Using the language detector

An alternative is to use a language detector available in the language_detector module. The detector utilizes the lingua python package, and allows automatic detection of the language of the text.

from anonipy.utils.language_detector import LanguageDetector

# initialize the language detector and detect the language
language_detector = LanguageDetector()
language_detector(original_text)# (1)!

The language_detector returns the literal tuple ("en", "English"), similar to the LANGUAGE.ENGLISH, making it compatible with the extractors.

Named Entity

Each extractor will extract the named entities from the text. The entities can be people names, organizations, addresses, social security numbers, etc. The entities are represented using the Entity dataclass, which consists of the following parameters:

Attributes:

Name	Type	Description
`text`	`str`	The text of the entity.
`label`	`str`	The label of the entity.
`start_index`	`int`	The start index of the entity in the text.
`end_index`	`int`	The end index of the entity in the text.
`score`	`float`	The prediction score of the entity. The score is returned by the extractor models.
`type`	`ENTITY_TYPES`	The type of the entity.
`regex`	`Union[str, Pattern]`	The regular expression the entity must match.

`get_regex_group()`

Returns:

Type	Description
`Union[str, None]`	The regex group.

`str()`

Returns:

Type	Description
`str`	The string representation of the entity.

Extractors

All following extractors are available in the extractors module.

Named entity recognition (NER) extractor

The NERExtractor extractor uses a span-based NER model to identify the relevant entities in the text. Furthermore, it uses the GLiNER span-based NER model, specifically the model finetuned for recognizing Personal Identifiable Information (PII) within text. The model has been finetuned on six languages (English, French, German, Spanish, Italian, and Portuguese), but can be applied also to other languages.

from anonipy.anonymize.extractors import NERExtractor

The NERExtractor takes the following input parameters:

Parameters:

Name	Type	Description	Default
`labels`	`List[dict]`	The list of labels to extract.	required
`lang`	`LANGUAGES`	The language of the text to extract.	`ENGLISH`
`score_th`	`float`	The score threshold. Entities with a score below this threshold will be ignored.	`0.5`
`use_gpu`	`bool`	Whether to use GPU.	`False`
`gliner_model`	`str`	The gliner model to use to identify the entities.	`'E3-JSI/gliner-multi-pii-domains-v1'`
`spacy_style`	`str`	The style the entities should be stored in the spacy doc. Options: `ent` or `span`.	`'ent'`

We must define the labels to be extracted and their types. In this example, we will extract the following entities:

labels = [
    {"label": "name", "type": "string"},
    {"label": "social security number", "type": "custom", "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}"},
    {"label": "date of birth", "type": "date"},
    {"label": "date", "type": "date"},
]

Let us now initialize the entity extractor.

ner_extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)

Initialization warnings

The initialization of NERExtractor will throw some warnings. Ignore them. These are expected due to the use of package dependencies.

The NERExtractor receives the text to be anonymized and returns the enriched text document and the extracted entities.

doc, entities = ner_extractor(original_text)

The entities extracted within the input text are:

ner_extractor.display(doc)

Medical Record

Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number

Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date

Advices and suggestions

Use specific label names. In the above example, we used specific label names to extract the entities. If we use a less specific name, the entity extractor might not find any relevant entity.

For instance, when using social security number as the label name, the entity extractor is able to extract the social security number from the text. However, if we use ssn or just number as the label name, the entity extractor might not find any relevant entity.

Tip

Using more specific label names is better.

Use custom regex patterns. In the anonipy package, we provide some predefined ENTITY_TYPES, which are:

Attributes:

Name	Type	Description
`CUSTOM`	`Literal[custom]`	The custom entity type.
`STRING`	`Literal[string]`	The string entity type.
`INTEGER`	`Literal[integer]`	The integer entity type.
`FLOAT`	`Literal[float]`	The float entity type.
`DATE`	`Literal[date]`	The date entity type.
`EMAIL`	`Literal[email]`	The email entity type.
`WEBSITE_URL`	`Literal[website_url]`	The website url entity type.
`PHONE_NUMBER`	`Literal[phone_number]`	The phone number entity type.

These entity types also have a corresponding regex pattern, as defined in the regex submodule.

If the user wants to use a custom regex pattern, they can define it in the labels variable list. Using a custom regex pattern allows the user to specify a more strict pattern that the entity must match.

Pattern extractor

The PatternExtractor is an extractor that uses a custom spacy and regex pattern to extract entities. When documents have a consistent format and structure, the pattern extractor can be useful, as it can extract entities in a consistent way.

from anonipy.anonymize.extractors import PatternExtractor

The PatternExtractor takes the following parameters:

Parameters:

Name	Type	Description	Default
`labels`	`List[dict]`	The list of labels and patterns to extract.	required
`lang`	`LANGUAGES`	The language of the text to extract.	`ENGLISH`
`spacy_style`	`str`	The style the entities should be stored in the spacy doc. Options: `ent` or `span`.	`'ent'`

We must define the labels and their patterns used to extract the relevant entities. The patterns are defined using spacy patterns or regex patterns.

In this example, we will use the following labels and patterns:

labels = [
    # the pattern is defined using regex patterns, where the paranthesis are used to indicate core entity values
    {"label": "symptoms", "regex": r"\((.*)\)"},
    # the pattern is defined using spacy patterns
    {
        "label": "medicine",
        "pattern": [[{"IS_ALPHA": True}, {"LIKE_NUM": True}, {"LOWER": "mg"}]],
    },
    # the pattern is defined using spacy patterns
    {
        "label": "date",
        "pattern": [
            [
                {"SHAPE": "dd"},
                {"TEXT": "-"},
                {"SHAPE": "dd"},
                {"TEXT": "-"},
                {"SHAPE": "dddd"},
            ]
        ],
    },
]

Let us now initialize the pattern extractor.

pattern_extractor = PatternExtractor(labels, lang=LANGUAGES.ENGLISH)

The PatternExtractor receives the original text and returns the enriched text document and the extracted entities.

doc, entities = pattern_extractor(original_text)

The entities extracted within the input text are:

pattern_extractor.display(doc)

Medical Record

Patient Name: John Doe
Date of Birth: 15-01-1985 date
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789

Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs ( blood pressure, heart rate, temperature symptoms ), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.

Medication Prescribed:

Ibuprofen 200 mg medicine : Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg medicine : Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date

Multi extractor

The MultiExtractor is a extractor that can be used to extract entities using multiple extractors.

The motivation behind the multi extractor is the following: depending on the document format, personal information can be located in different locations; some of them can be found at similar places, while others can be found in different places and formats. Because of this, we would need to use the NERExtractor to automatically identify the entities at different locations and the PatternExtractor to extract the entities that appear at the same location.

MultiExtractor enables the use of both extractors at the same time. Furthermore, if both extractors identify entities at similar locations, then the MultiExtractor will also provide a list of joint entities.

from anonipy.anonymize.extractors import MultiExtractor

The MultiExtractor takes the following parameters:

Parameters:

Name	Type	Description	Default
`extractors`	`List[ExtractorInterface]`	The list of extractors to use.	required

In this example, we will use the previously initialized NER and pattern extractors.

multi_extractor = MultiExtractor(
  extractors=[ner_extractor, pattern_extractor],
)

Similar as before, the MultiExtractor receives the original text, but returns the outputs of all the extractors, as well as the joint entities from all the extractors.

extractor_outputs, joint_entities = multi_extractor(original_text)

In this case, extractor_outputs[0] will contain the (doc, entities) from the NER extractor, and extractor_outputs[1] will contain the (doc, entities) from the pattern extractor. The joint_entities will contain the joint entities from all the extractors.

Conclusion

The extractors are used to extract entities from the text. The anonipy package supports both machine learning-based and pattern-based entity extraction, enabling information identification and extraction from different textual formats.

Extractors overview

Pre-requisites

Language configuration

Using the language detector

Named Entity

get_regex_group()

__str__()

Extractors

Named entity recognition (NER) extractor

Pattern extractor

Multi extractor

Conclusion

`get_regex_group()`

`str()`