Extractors¶
This chapter showcases how to use the label extractors in the package.
The label extractors are used to extract relevant named entities
from text. These
entities can be people names, organizations, addresses, social security numbers, etc.
The entities are then used to anonymize the text.
# used to hide warnings
import warnings
warnings.filterwarnings("ignore")
Let us first define the text, from which we want to extract the entities.
original_text = """\
Medical Record
Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789
Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
Language configuration¶
First, we must specify the language that the text is written in. We can do this manually or by using a language detector.
Manual selection¶
One option, when all of the texts are in the same language, is to use manually specifying the text language.
In the anonipy
package, we provide a constant called LANGUAGES
in the constants
submodule, which
contains all the supported languages. Please find the format of the language code in the constants
module.
Since the original_text
is in English, we will use the LANGUAGES.ENGLISH
predefined constant.
from anonipy.constants import LANGUAGES
LANGUAGES.ENGLISH
('en', 'English')
Using language detector¶
An alternative is to use a language detector available in the anonipy
package.
The language detector is created using the lingua python package,
and allows automatic detection of the text language.
from anonipy.utils.language_detector import LanguageDetector
Initialize the language detector and use it to automatically detect the language of the text.
lang_detector = LanguageDetector()
lang_detector(original_text)
('en', 'English')
Using extractors¶
Initialization¶
We can now initialize the label extractors. This is done using the EntityExtractor
class found in anonipy.anonymize.extractors
submodule.
Info
The EntityExtractor
class is created using the GLiNER models, specifically the one that is finetuned for recognizing Personally Identifiable Information (PII) within text. The model has been finetuned on six languages (English, French, German, Spanish, Italian, and Portuguese), but can be applied also to other languages.
from anonipy.anonymize.extractors import EntityExtractor
The EntityExtractor
class takes the following arguments:
labels
: A list of dictionaries containing the labels to be extracted.lang
: The language of the text to be anonymized. Defaults toLANGUAGES.ENGLISH
.score_th
: The score threshold used to filter the labels, i.e. the entity has to have a score greater thanscore_th
to be considered. Defaults to 0.5.use_gpu
: Whether to use the GPU. Defaults toFalse
.
We must now define the labels to be extracted. In this example, we will extract the people name, the dates, and the social security number from the text.
labels = [
{"label": "name", "type": "string"},
{
"label": "social security number",
"type": "custom",
"regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
},
{"label": "date of birth", "type": "date"},
{"label": "date", "type": "date"},
]
Let us now initialize the entity extractor.
Info
The initialization of EntityExtractor
will throw some warnings. Ignore them.
These are expected due to the use of package dependencies.
entity_extractor = EntityExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)
Entity extraction¶
The EntityExtractor
receives the text to be anonymized and returns the enriched text document and the extracted entities.
doc, entities = entity_extractor(original_text)
The entities extracted within the input text are:
entity_extractor.display(doc)
Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number
Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date
The extracted entities are stored in the entities
variable. Each entity contains the following information:
text
: The text of the entity.label
: The label of the entity.start_index
: The start index of the entity in the text.end_index
: The end index of the entity in the text.score
: The score of the entity. It shows how certain the model is that the entity is relevant.type
: The type of the entity (taken from the definedlabels
variable list).regex
: The regular expression the entity must match.
entities
[Entity(text='John Doe', label='name', start_index=30, end_index=38, score=0.9961156845092773, type='string', regex='.*'), Entity(text='15-01-1985', label='date of birth', start_index=54, end_index=64, score=0.9937193393707275, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'), Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=0.9867385625839233, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})'), Entity(text='123-45-6789', label='social security number', start_index=121, end_index=132, score=0.9993416666984558, type='custom', regex='[0-9]{3}-[0-9]{2}-[0-9]{4}'), Entity(text='John Doe', label='name', start_index=157, end_index=165, score=0.994924783706665, type='string', regex='.*'), Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=0.8285622596740723, type='date', regex='(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})')]
Advices and suggestions¶
Use specific label names. In the above example, we used specific label names to extract the entities. If we use a less specific name, the entity extractor might not find any relevant entity.
For instance, when using social security number
as the label name, the entity extractor
is able to extract the social security number from the text. However, if we use ssn
or
just number
as the label name, the entity extractor might not find any relevant entity.
Tip
Using more specific label names is better.
Use custom regex patterns.
In the anonipy
package, we provide some predefined entity types, which are:
string
. Extracts a string from the text.integer
. Extracts an integer from the text.float
. Extracts a float from the text.date
. Extracts a date from the text.email
. Extracts an email address from the text.phone_number
. Extracts a phone number from the text.website_url
. Extracts an URL from the text.
These entity types also have a corresponding regex pattern, as defined in the anonipy.anonymize.regex
submodule.
from anonipy.anonymize.regex import regex_map
for type in [
"string",
"integer",
"float",
"date",
"email",
"phone_number",
"website_url",
]:
print(f"{type:<13}: {regex_map(type)}")
string : .* integer : \d+ float : [\d\.,]+ date : (\d{1,2}[\/\-\.]\d{1,2}[\/\-\.]\d{2,4})|(\d{2,4}[\/\-\.]\d{1,2}[\/\-\.]\d{1,2}) email : [a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* phone_number : [(]?[\+]?[(]?[0-9]{1,3}[)]?[-\s\.]?([0-9]{2,}[-\s\.]?){2,}([0-9]{3,}) website_url : ((https?|ftp|smtp):\/\/)?(www.)?([a-zA-Z0-9]+\.)+[a-z]{2,}(\/[a-zA-Z0-9#\?\_\.\=\-\&]+|\/?)*
If the user wants to use a custom regex pattern, they can define it in the labels
variable list. Using a custom regex pattern allows the user to specify a more strict
pattern that the entity must match.
The custom regex can be specified in the following way:
labels = [
{"label": "name", "type": "string"},
# using the custom regex pattern: type must be 'custom' and specify the regex pattern in the 'regex' key
{
"label": "social security number",
"type": "custom",
"regex": "[0-9]{3}-[0-9]{2}-[0-9]{4}",
},
{"label": "date of birth", "type": "date"},
{"label": "date", "type": "date"},
]
Lets rerun the above example:
# ignore the warnings: these are expected due to the use of package dependencies
entity_extractor = EntityExtractor(labels, lang=LANGUAGES.ENGLISH, score_th=0.5)
doc, entities = entity_extractor(original_text)
The extracted entities are the same as before. The difference is that the social security number now also had to match the custom regex pattern.
entity_extractor.display(doc)
Patient Name: John Doe name
Date of Birth: 15-01-1985 date of birth
Date of Examination: 20-05-2024 date
Social Security Number: 123-45-6789 social security number
Examination Procedure:
John Doe name underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024 date
Creating custom extractors¶
The user can develop their own custom extractor. To do this, the custom extractor
must inherit from the ExtractorInterface
class.
The extractor must have two methods defined: __init__
and __call__
.
An example of a custom extractor that extracts only a specific regex pattern from the text is shown below:
import re
from anonipy.anonymize.extractors import ExtractorInterface
from anonipy.definitions import Entity
class CustomExtractor(ExtractorInterface):
def __init__(self):
# the custom extractor will retrieve entities that follow the regex pattern
self.regex_pattern = re.compile(r"\d{1,2}-\d{1,2}-\d{2,4}")
def __call__(self, text: str) -> tuple[str, list[Entity]]:
entities = []
for match in re.finditer(self.regex_pattern, text):
entities.append(
Entity(
text=match.group(),
label="date",
start_index=match.start(),
end_index=match.end(),
score=1.0,
type="date",
regex=self.regex_pattern,
)
)
return text, entities
custom_extractor = CustomExtractor()
_, entities = custom_extractor(original_text)
Let us output the extracted entities. Note that the third entity corresponds to a part of the social security number.
entities
[Entity(text='15-01-1985', label='date', start_index=54, end_index=64, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')), Entity(text='20-05-2024', label='date', start_index=86, end_index=96, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')), Entity(text='23-45-6789', label='date', start_index=122, end_index=132, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}')), Entity(text='15-11-2024', label='date', start_index=717, end_index=727, score=1.0, type='date', regex=re.compile('\\d{1,2}-\\d{1,2}-\\d{2,4}'))]