Extractors overview
In this post, we will show an overview of the implemented extractors. The extractors are used to extract relevant named entities
from text. These entities can be people names, organizations, addresses, social security numbers, etc. The entities are then used to anonymize the text.
All extractors and their API references are available in the extractors module. What follows is the presentation of the different extractors anonipy
provides.
Pre-requisites
Let us first define the text, from which we want to extract the entities.
original_text = """\
Medical Record
Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789
Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
Language configuration
Each extractor requires a language to be configured. The language is used to determine how to process the text. If the language is not specified, the extractor will use the default language. The default language is ENGLISH
.
To make it easier to switch languages, we can use the LANGUAGES constant.
- The
LANGUAGE.ENGLISH
return the("en", "English")
literal tuple, which is the format required by the extractors.
Using the language detector
An alternative is to use a language detector available in the language_detector module. The detector utilizes the lingua python package, and allows automatic detection of the language of the text.
from anonipy.utils.language_detector import LanguageDetector
# initialize the language detector and detect the language
language_detector = LanguageDetector()
language_detector(original_text)# (1)!
- The
language_detector
returns the literal tuple("en", "English")
, similar to theLANGUAGE.ENGLISH
, making it compatible with the extractors.
Named Entity
Each extractor will extract the named entities
from the text. The entities can be people names, organizations, addresses, social security numbers, etc. The entities are represented using the Entity dataclass, which consists of the following parameters:
Attributes:
Name | Type | Description |
---|---|---|
text |
str
|
The text of the entity. |
label |
str
|
The label of the entity. |
start_index |
int
|
The start index of the entity in the text. |
end_index |
int
|
The end index of the entity in the text. |
score |
float
|
The prediction score of the entity. The score is returned by the extractor models. |
type |
ENTITY_TYPES
|
The type of the entity. |
regex |
Union[str, Pattern]
|
The regular expression the entity must match. |
get_regex_group()
Returns:
Type | Description |
---|---|
Union[str, None]
|
The regex group. |
Extractors
All following extractors are available in the extractors module.
Named entity recognition (NER) extractor
The NERExtractor extractor uses a span-based NER model to identify the relevant entities in the text. Furthermore, it uses the GLiNER span-based NER model, specifically the model finetuned for recognizing Personal Identifiable Information (PII) within text. The model has been finetuned on six languages (English, French, German, Spanish, Italian, and Portuguese), but can be applied also to other languages.
The NERExtractor
takes the following input parameters:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels
|
List[dict]
|
The list of labels to extract. |
required |
lang
|
LANGUAGES
|
The language of the text to extract. |
ENGLISH
|
score_th
|
float
|
The score threshold. Entities with a score below this threshold will be ignored. |
0.5
|
use_gpu
|
bool
|
Whether to use GPU. |
False
|
gliner_model
|
str
|
The gliner model to use to identify the entities. |
'urchade/gliner_multi_pii-v1'
|
spacy_style
|
str
|
The style the entities should be stored in the spacy doc. Options: |
'ent'
|
We must define the labels to be extracted and their types. In this example, we will extract the following entities:
labels = [
{"label": "name", "type": "string"},
{"label": "social security number", "type": "custom", "regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}"},
{"label": "date of birth", "type": "date"},
{"label": "date", "type": "date"},
]
Let us now initialize the entity extractor.
Initialization warnings
The initialization of NERExtractor
will throw some warnings. Ignore them. These are expected due to the use of package dependencies.
The NERExtractor
receives the text to be anonymized and returns the enriched text document and the extracted entities.
The entities extracted within the input text are:
Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number
Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date
Advices and suggestions
Use specific label names. In the above example, we used specific label names to extract the entities. If we use a less specific name, the entity extractor might not find any relevant entity.
For instance, when using social security number
as the label name, the entity extractor is able to extract the social security number from the text. However, if we use ssn
or just number
as the label name, the entity extractor might not find any relevant entity.
Tip
Using more specific label names is better.
Use custom regex patterns.
In the anonipy
package, we provide some predefined ENTITY_TYPES, which are:
Attributes:
Name | Type | Description |
---|---|---|
CUSTOM |
Literal[custom]
|
The custom entity type. |
STRING |
Literal[string]
|
The string entity type. |
INTEGER |
Literal[integer]
|
The integer entity type. |
FLOAT |
Literal[float]
|
The float entity type. |
DATE |
Literal[date]
|
The date entity type. |
EMAIL |
Literal[email]
|
The email entity type. |
WEBSITE_URL |
Literal[website_url]
|
The website url entity type. |
PHONE_NUMBER |
Literal[phone_number]
|
The phone number entity type. |
These entity types also have a corresponding regex pattern, as defined in the regex submodule.
If the user wants to use a custom regex pattern, they can define it in the labels
variable list. Using a custom regex pattern allows the user to specify a more strict pattern that the entity must match.
Pattern extractor
The PatternExtractor is an extractor that uses a custom spacy and regex pattern to extract entities. When documents have a consistent format and structure, the pattern extractor can be useful, as it can extract entities in a consistent way.
The PatternExtractor
takes the following parameters:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels
|
List[dict]
|
The list of labels and patterns to extract. |
required |
lang
|
LANGUAGES
|
The language of the text to extract. |
ENGLISH
|
spacy_style
|
str
|
The style the entities should be stored in the spacy doc. Options: |
'ent'
|
We must define the labels and their patterns used to extract the relevant entities. The patterns are defined using spacy patterns or regex patterns.
In this example, we will use the following labels and patterns:
labels = [
# the pattern is defined using regex patterns, where the paranthesis are used to indicate core entity values
{"label": "symptoms", "regex": r"\((.*)\)"},
# the pattern is defined using spacy patterns
{
"label": "medicine",
"pattern": [[{"IS_ALPHA": True}, {"LIKE_NUM": True}, {"LOWER": "mg"}]],
},
# the pattern is defined using spacy patterns
{
"label": "date",
"pattern": [
[
{"SHAPE": "dd"},
{"TEXT": "-"},
{"SHAPE": "dd"},
{"TEXT": "-"},
{"SHAPE": "dddd"},
]
],
},
]
Let us now initialize the pattern extractor.
The PatternExtractor
receives the original text and returns the enriched text document and the extracted entities.
The entities extracted within the input text are:
Patient Name: John Doe
Date of Birth: 15-01-1985 date
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789
Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs ( blood pressure, heart rate, temperature symptoms ), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg medicine : Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg medicine : Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date
Multi extractor
The MultiExtractor is a extractor that can be used to extract entities using multiple extractors.
The motivation behind the multi extractor is the following: depending on the document format, personal information can be located in different locations; some of them can be found at similar places, while others can be found in different places and formats. Because of this, we would need to use the NERExtractor to automatically identify the entities at different locations and the PatternExtractor to extract the entities that appear at the same location.
MultiExtractor
enables the use of both extractors at the same time. Furthermore, if both
extractors identify entities at similar locations, then the MultiExtractor
will also
provide a list of joint entities.
The MultiExtractor
takes the following parameters:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
extractors
|
List[ExtractorInterface]
|
The list of extractors to use. |
required |
In this example, we will use the previously initialized NER and pattern extractors.
Similar as before, the MultiExtractor
receives the original text, but returns the outputs of all the extractors, as well as the joint entities from all the extractors.
In this case, extractor_outputs[0]
will contain the (doc, entities)
from the NER extractor, and extractor_outputs[1]
will contain the (doc, entities)
from the pattern extractor.
The joint_entities
will contain the joint entities from all the extractors.
Conclusion
The extractors are used to extract entities from the text. The anonipy
package supports both machine learning-based and pattern-based entity extraction, enabling information identification and extraction from different textual formats.