Skip to content

Tutorial

Document anonymization using pipeline

In this blog post, we show how one can anonymize documents using the new Pipeline module. The module allows for a streamlined process of anonymizing documents, where the user defines how the anonymization should be performed and the locations, where the documents to be anonymized are located and where the anonymized documents should be stored.

The pipeline will automatically extract the text from the documents, anonymize the text, and store the anonymized text in the output folder.

Anonymizing collections of documents

In the previous blog post, we showed how one can anonymize text in document form. While the code is useful for processing a single document, anonymizing a collection of documents can take time if we run the script for each document separately.

In this blog post, we show how one can anonymize collections of documents. The process is similar to the previous blog post, but loads all required components only once, and anonymizes all documents in one go.

Anonymizing documents

The anonipy package was designed for anonymizing text. However, a lot of text data can be found in document form, such as PDFs, word documents, and other. Copying the text from the documents to be anonymized can be cumbersome. The anonipy package provides utility functions that extracts the text from the documents.

In this blog post, we explain how anonipy can be used to anonymize texts in document form.