Generators¶
This chapter showcases the generators in the anonipy
package.
The main motivation behind generators is to generate replacements for entities.
In order to do this, anonipy
has implemented a number of generators for generating:
- strings
- numbers
- dates
All of the generators are implemented in the anonipy.anonymize.generators
module.
In the following section, we will present each generator in detail.
# used to hide warnings
import warnings
warnings.filterwarnings("ignore")
Let us first define the text and the associated entities, as seen in the previous chapter (see Extractors).
original_text = """\
Medical Record
Patient Name: John Doe
Date of Birth: 15-01-1985
Date of Examination: 20-05-2024
Social Security Number: 123-45-6789
Examination Procedure:
John Doe underwent a routine physical examination. The procedure included measuring vital signs (blood pressure, heart rate, temperature), a comprehensive blood panel, and a cardiovascular stress test. The patient also reported occasional headaches and dizziness, prompting a neurological assessment and an MRI scan to rule out any underlying issues.
Medication Prescribed:
Ibuprofen 200 mg: Take one tablet every 6-8 hours as needed for headache and pain relief.
Lisinopril 10 mg: Take one tablet daily to manage high blood pressure.
Next Examination Date:
15-11-2024
"""
Normally, the entities are extracted using the the EntityExtractor
. For this section,
we manually define the entities:
from anonipy.definitions import Entity
entities = [
Entity(
text="John Doe",
label="name",
start_index=30,
end_index=38,
score=1.0,
type="string",
regex=".*",
),
Entity(
text="15-01-1985",
label="date of birth",
start_index=54,
end_index=64,
score=1.0,
type="date",
regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
),
Entity(
text="20-05-2024",
label="date",
start_index=86,
end_index=96,
score=1.0,
type="date",
regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
),
Entity(
text="123-45-6789",
label="social security number",
start_index=121,
end_index=132,
score=1.0,
type="custom",
regex="[0-9]{3}-[0-9]{2}-[0-9]{4}",
),
Entity(
text="John Doe",
label="name",
start_index=157,
end_index=165,
score=1.0,
type="string",
regex=".*",
),
Entity(
text="15-11-2024",
label="date",
start_index=717,
end_index=727,
score=1.0,
type="date",
regex="(\\d{1,2}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{2,4})|(\\d{2,4}[\\/\\-\\.]\\d{1,2}[\\/\\-\\.]\\d{1,2})",
),
]
LLMLabelGenerator¶
Warning
The LLMLabelGenerator
utilizes the open source LLMs,
specifically the Llama 3 model.
Because the model is quite large, we utilize quantization using the bitsandbytes
package to reduce its size.
Therefore, the LLMLabelGenerator
requires at least 8GB GPU and CUDA drivers to be available.
If these resources are not available on your machine, you can use the MaskLabelGenerator
instead.
The LLMLabelGenerator
is a one-stop-shop generator that utilizes LLMs to generate replacements for entities. It is implemented to support any entity type.
For more details, please check the LLMLabelGenerator
class implementation.
Let us first import the generator and initialize it.
Info
The initialization of LLMLabelGenerator
will throw some warnings. Ignore them.
These are expected due to the use of package dependencies.
from anonipy.anonymize.generators import LLMLabelGenerator
llm_generator = LLMLabelGenerator()
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
entity
: The entity to generate a replacement for.entity_prefix
: The prefix to use for the replacement (Default: "").temperature
: The temperature to use when generating the replacement. This value should be between 0 and 1, where 0 is the least random and 1 is the most random generation (Default: 0).
Let us generate the replacement for the first entity using the default parameters.
llm_generator.generate(entities[0])
'Ethan Thompson'
Let us now change the label prefix and generate the replacement using a higher temperature.
llm_generator.generate(entities[0], entity_prefix="Spanish", temperature=0.7)
'Juan Rodrigez'
Let us now generate a replacement for each entity using the default parameters.
for entity in entities:
print(f"{entity.text:<12} | {entity.label:<22} | {llm_generator.generate(entity)}")
John Doe | name | Ethan Thompson 15-01-1985 | date of birth | 24-07-1992 20-05-2024 | date | 23-07-2027 123-45-6789 | social security number | 987-65-4321 John Doe | name | Ethan Thompson 15-11-2024 | date | 23-02-2027
Advices and suggestions¶
Using LLMLabelGenerator only for string and custom types.
While the LLMLabelGenerator
is able to generate alternatives for different entity
types, we suggest using it only for string and custom entity types. The reason is
that the LLMs can be quite slow for generating replacements.
In addition, anonipy
has other generators that can be used for other entity types, such as dates, numbers, etc.
Restricting with regex.
Using LLMs to generate text is best when the generation is restricted to a specific pattern.
The Entity
object already contains a regex
field that can be used to restrict the generation
to a specific pattern. However, it is recommended to specify to have as specific and restrictive
regex expressions as possible.
This will help the LLMLabelGenerator
to generate more accurate replacements.
MaskLabelGenerator¶
The MaskLabelGenerator
is a generator that uses smaller language models, such as XLM-RoBERTa, to generate replacements for entities. It is implemented to support any entity type, but we suggest using it
with string entities. For other entity types, please use other generators.
For more details, please check the MaskLabelGenerator
class implementation.
Let us first import the generator and initialize it. The generator at initialization can receive the following parameters:
model_name
: The model to use for the generation (Default: "FacebookAI/xlm-roberta-large").use_gpu
: Whether to use the GPU for the generation (Default: False).context_window
: The size of the context window to both sides of the entity to use for the generation. If the context window is set to 100, the context will be the 100 characters before and after the entity (Default: 100).
Info
The initialization of MaskLabelGenerator
will throw some warnings. Ignore them.
These are expected due to the use of package dependencies.
from anonipy.anonymize.generators import MaskLabelGenerator
# initialization using default parameters
mask_generator = MaskLabelGenerator()
Some weights of the model checkpoint at FacebookAI/xlm-roberta-large were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] - This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
entity
: The entity to generate a replacement for.original_text
: The original text from which the generator will retrieve the context of the entity text.
This generator will create a list of suggestions from which it will select one at random. Therefore, the generator will return different suggestions every time it is called.
mask_generator.generate(entities[0], text=original_text)
'James Smith'
mask_generator.generate(entities[0], text=original_text)
'Michael Smith'
mask_generator.generate(entities[0], text=original_text)
'David Smith'
for entity in entities:
print(
f"{entity.text:<12} | {entity.label:<22} | {mask_generator.generate(entity, text=original_text)}"
)
John Doe | name | Thomas David 15-01-1985 | date of birth | None 20-05-2024 | date | None 123-45-6789 | social security number | None John Doe | name | Officer first 15-11-2024 | date | None
Advices and suggestions¶
Using only for string entities.
As seen from the above examples, the MaskLabelGenerator
is best used with string entities.
For number and date entities, it is best to use other generators, such as NumberGenerator
and DateGenerator
.
NumberGenerator¶
The NumberGenerator
is a generator for generating random numbers. It is implemented to support integers, floats, and
phone numbers, but it can be used to generate values for custom types which include numbers.
For more details, please check the NumberGenerator
class implementation.
Let us first import the generator and initialize it. The generator at initialization does not need any parameters.
from anonipy.anonymize.generators import NumberGenerator
number_generator = NumberGenerator()
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
entity
: The number entity to generate a replacement for.
This generator will create a suggestion by replacing numeric values in the entity text at random. Therefore, the generator will return different suggestions every time it is called.
number_generator.generate(entities[3])
'143-46-4915'
Furthermore, it will throw an error if the entity type is not integer
, float
, phone_number
or custom
.
try:
number_generator.generate(entities[0])
except Exception as e:
print(e)
The entity type must be `integer`, `float`, `phone_number` or `custom` to generate numbers.
DateGenerator¶
The DateGenerator
is a generator for generating dates. It is implemented to support date entities.
For more details, please check the DateGenerator
class implementation.
Let us first import the generator and initialize it. The generator at initialization can receive the following parameters:
date_format
: The format in which the dates will be provided and generated (Default: "%d-%m-%Y").day_sigma
: The number of days to add or subtract from the date when using therandom
generator method (see below) (Default: 30).
from anonipy.anonymize.generators import DateGenerator
date_generator = DateGenerator()
To use the generator, we can call the generate
method. The generate
method receives the following parameters:
entity
: The number entity to generate a replacement for.output_gen
: the method used to generate the date (Default: "random"). It can be one of:random
: generates a random date that is betweenentity
andentity
$\pm$day_sigma
days.first_day_of_the_month
: returns the first day of the month ofentity
.last_day_of_the_month
: returns the last day of the month ofentity
.middle_of_the_month
: returns the middle day of the month ofentity
.middle_of_the_year
: returns the middle day of the year ofentity
.
Using the above parameters, this generator will create the appropriate date suggestions:
entities[2].text
'20-05-2024'
date_generator.generate(entities[2], output_gen="random")
'26-05-2024'
date_generator.generate(entities[2], output_gen="first_day_of_the_month")
'01-05-2024'
date_generator.generate(entities[2], output_gen="last_day_of_the_month")
'31-05-2024'
date_generator.generate(entities[2], output_gen="middle_of_the_month")
'15-05-2024'
date_generator.generate(entities[2], output_gen="middle_of_the_year")
'01-07-2024'
Furthermore, it will throw an error if the entity type is not date
.
try:
date_generator.generate(entities[0])
except Exception as e:
print(e)
The entity type must be `date` to generate dates.
Creating custom generator¶
The user can develop their own custom generators. To do this, the custom generator
must inherit from the GeneratorInterface
class.
The generator must have two methods defined: __init__
and generate
,
where the generate
method must accept at least the entity.
An example of a custom generator that will generate only emojis is shown below:
import random
from anonipy.anonymize.generators import GeneratorInterface
from anonipy.definitions import Entity
class CustomGenerator(GeneratorInterface):
def __init__(self):
self.emojis = ["😄", "🤗", "😢"]
def generate(self, entity: Entity) -> tuple[str, list[Entity]]:
return random.choice(self.emojis)
custom_generator = CustomGenerator()
custom_generator.generate(entities[0])
'😄'