Generators overview
In this post, we will show an overview of the implemented generators. The generators are used to create new texts that would serve as substitutes to the extracted named entities
. The substitutes can be then used to replace and anonymize the text.
All generators and their API references are available in the generators module. What follows is the presentation of the different generators anonipy
provides.
Pre-requisites
Let us first define the text, from which we want to extract the entities.
original_text = """\
Medical Record
Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789
Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
Normally, the entities would be extracted using an extractor. For this example, we manually define the entities.
from anonipy.definitions import Entity
entities = [
Entity(
text="John Doe",
label="name",
start_index=30,
end_index=38,
type="string",
),
Entity(
text="20-05-2024",
label="date",
start_index=86,
end_index=96,
type="date",
),
Entity(
text="123-45-6789",
label="social security number",
start_index=121,
end_index=132,
type="custom",
regex="\d{3}-\d{2}-\d{4}",
),
]
Generators
All following generators are available in the generators module.
The LLM label generator
The LLMLabelGenerator is a one-stop-shop generator that utilizes LLMs to generate replacements for entities. It is implemented to support any entity type.
The LLMLabelGenerator
currently does not require any input parameters at initialization.
Let us now initialize the LLM label generator.
Initialization warnings
The initialization of LLMLabelGenerator
will throw some warnings. Ignore them. These are expected due to the use of package dependencies.
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity
|
Entity
|
The entity to generate the label from. |
required |
add_entity_attrs
|
str
|
Additional entity attribute description to add to the generation. |
''
|
temperature
|
float
|
The temperature to use for the generation. |
1.0
|
top_p
|
float
|
The top p to use for the generation. |
0.95
|
Let us generate the replacement for the first entity from entities
using the default parameters.
- The generator receives the
John Doe
name entity and might return the replacement:Ethan Thomson
Let us now change the label prefix and generate the replacement using a higher temperature.
- The generator receives the
John Doe
name entity and under the different generation parameters might return the replacement:Juan Rodrigez
Going through the whole entities
list, the LLMLabelGenerator
, using the default parameters, might generate the following replacements:
Entity | Type | Label | Replacement |
---|---|---|---|
John Doe |
string |
name |
Ethan Thomson |
20-05-2024 |
date |
date |
23-07-2027 |
123-45-6789 |
custom |
social security number |
987-65-4321 |
Advices and suggestions
Using LLMLabelGenerator only for string and custom types.
While the LLMLabelGenerator
is able to generate alternatives for different entity types, we suggest using it only for string and custom entity types. The reason is that the LLMs can be quite slow for generating replacements.
In addition, anonipy
has other generators that can be used for other entity types, such as dates, numbers, etc.
Restricting with regex.
Using LLMs to generate text is best when the generation is restricted to a specific pattern. The Entity object already contains a regex
field that can be used to restrict the generation
to a specific pattern. However, it is recommended to specify to have as specific and restrictive regex expressions as possible.
This will help the LLMLabelGenerator
to generate more accurate replacements.
The mask label generator
The MaskLabelGenerator is a generator that uses smaller language models, such as XLM-RoBERTa, to generate replacements for entities. It is implemented to support any entity type, but we suggest using it only with string entities. For other entity types, please use other available generators.
The MaskLabelGenerator
requires the following input parameters at initialization:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
The name of the masking model to use. |
'FacebookAI/xlm-roberta-large'
|
use_gpu
|
bool
|
Whether to use GPU/CUDA, if available. |
False
|
context_window
|
int
|
The context window size. |
100
|
Let us now initialize the mask label generator.
Initialization warnings
The initialization of LLMLabelGenerator
will throw some warnings. Ignore them. These are expected due to the use of package dependencies.
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity
|
Entity
|
The entity used to generate the substitute. |
required |
text
|
str
|
The original text in which the entity is located; used to get the entity's context. |
required |
This generator will create a list of suggestions from which it will select one at random. Therefore, the generator will return different suggestions every time it is called.
mask_generator.generate(entities[0], text=original_text)# (1)!
mask_generator.generate(entities[0], text=original_text)# (2)!
mask_generator.generate(entities[0], text=original_text)# (3)!
- The first generation for the
John Doe
name entity might return the replacement:James Smith
- The second generation might return the replacement:
Michael Smith
- The third generation might return the replacement:
David Blane
Advices and suggestions
Using only for string entities.
As seen from the above examples, the MaskLabelGenerator
is best used with string entities. For number and date entities, it is best to use other generators, such as NumberGenerator
and DateGenerator
.
The number generator
The NumberGenerator is a generator for generating random numbers. It is implemented to support integers, floats, and phone numbers, but it can be used to generate values for custom types which include numbers.
The NumberGenerator
currently does not require any input parameters at initialization.
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity
|
Entity
|
The numeric entity to generate the numeric substitute. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If the entity type is not |
This generator will create a suggestion by replacing numeric values in the entity text at random. Therefore, the generator will return different suggestions every time it is called.
- For the
social security number
entity, the generator will return a replacement, such as:143-46-4915
.
Furthermore, it will throw an error if the entity type is not integer
, float
, phone_number
or custom
.
- The provided entity is a
string
, therefore it will raise an error. - The exception will state
The entity type must be 'integer', 'float', 'phone_number' or 'custom' to generate numbers.
The date generator
The DateGenerator is a generator for generating dates. It is implemented to support date entities.
The DateGenerator
requires the following input parameters at initialization:
Let us now initialize the date generator.
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity
|
Entity
|
The entity to generate the date substitute from. |
required |
sub_variant
|
DATE_TRANSFORM_VARIANTS
|
The substitute function variant to use. |
RANDOM
|
Using the above parameters, this generator will create the appropriate date suggestions:
entities[1]# (1)!
date_generator.generate(entities[1], sub_variant="RANDOM")# (2)!
date_generator.generate(entities[1], sub_variant="FIRST_DAY_OF_THE_MONTH")# (3)!
date_generator.generate(entities[1], sub_variant="LAST_DAY_OF_THE_MONTH")# (4)!
date_generator.generate(entities[1], sub_variant="MIDDLE_OF_THE_MONTH")# (5)!
date_generator.generate(entities[1], sub_variant="MIDDLE_OF_THE_YEAR")# (6)!
- The entity is a
date
entity with the text20-05-2024
. - The
RANDOM
sub variant will return a random date within the given date range. A possible generation can be:26-05-2024
- The
FIRST_DAY_OF_THE_MONTH
sub variant will return the first day of the month:01-05-2024
- The
LAST_DAY_OF_THE_MONTH
sub variant will return the last day of the month:31-05-2024
- The
MIDDLE_OF_THE_MONTH
sub variant will return the middle day of the month:15-05-2024
- The
MIDDLE_OF_THE_YEAR
sub variant will return the middle day of the year:01-07-2024
Furthermore, it will throw an error if the entity type is not date
.
- The provided entity is a
string
, therefore it will raise an error. - The exception will state
The entity type must be 'date' to generate dates.
Conclusion
The generators are used to create new texts that would serve as substitutes to the extracted named entities
. The substitutes can be then used to replace and anonymize the text.