Skip to content

2024

Document anonymization using pipeline

In this blog post, we show how one can anonymize documents using the new Pipeline module. The module allows for a streamlined process of anonymizing documents, where the user defines how the anonymization should be performed and the locations, where the documents to be anonymized are located and where the anonymized documents should be stored.

The pipeline will automatically extract the text from the documents, anonymize the text, and store the anonymized text in the output folder.

Anonymizing collections of documents

In the previous blog post, we showed how one can anonymize text in document form. While the code is useful for processing a single document, anonymizing a collection of documents can take time if we run the script for each document separately.

In this blog post, we show how one can anonymize collections of documents. The process is similar to the previous blog post, but loads all required components only once, and anonymizes all documents in one go.

Anonymizing documents

The anonipy package was designed for anonymizing text. However, a lot of text data can be found in document form, such as PDFs, word documents, and other. Copying the text from the documents to be anonymized can be cumbersome. The anonipy package provides utility functions that extracts the text from the documents.

In this blog post, we explain how anonipy can be used to anonymize texts in document form.

Extractors overview

In this post, we will show an overview of the implemented extractors. The extractors are used to extract relevant named entities from text. These entities can be people names, organizations, addresses, social security numbers, etc. The entities are then used to anonymize the text.

All extractors and their API references are available in the extractors module. What follows is the presentation of the different extractors anonipy provides.

Generators overview

In this post, we will show an overview of the implemented generators. The generators are used to create new texts that would serve as substitutes to the extracted named entities. The substitutes can be then used to replace and anonymize the text.

All generators and their API references are available in the generators module. What follows is the presentation of the different generators anonipy provides.

Strategies overview

In this post, we will show an overview of the implemented strategies. The strategies delegate how the original text will be anonymized given the extracted named entities. They output the anonymized text and the list of replacements that were made to the original text.

All strategies and their API references are available in the strategies module. What follows is the presentation of the different strategies anonipy provides.