Skip to content

Extractors Module

anonipy.anonymize.extractors

Module containing the extractors.

The extractors module provides a set of extractors used to identify relevant information within a document.

Classes:

Name Description
NERExtractor

The class representing the named entity recognition (NER) extractor.

PatternExtractor

The class representing the pattern extractor.

MultiExtractor

The class representing the multi extractor.

anonipy.anonymize.extractors.NERExtractor

Bases: ExtractorInterface

The class representing the named entity recognition (NER) extractor.

Examples:

>>> from anonipy.constants import LANGUAGES
>>> from anonipy.anonymize.extractors import NERExtractor
>>> labels = [{"label": "PERSON", "type": "string"}]
>>> extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH)
>>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
Doc, [Entity]

Attributes:

Name Type Description
labels List[dict]

The list of labels to extract.

lang str

The language of the text to extract.

score_th float

The score threshold.

use_gpu bool

Whether to use GPU.

gliner_model str

The gliner model to use.

pipeline Language

The spacy pipeline for extracting entities.

spacy_style str

The style the entities should be stored in the spacy doc.

Methods:

Name Description
__call__

Extract the entities from the text.

display

Display the entities in the text.

Source code in anonipy/anonymize/extractors/ner_extractor.py
class NERExtractor(ExtractorInterface):
    """The class representing the named entity recognition (NER) extractor.

    Examples:
        >>> from anonipy.constants import LANGUAGES
        >>> from anonipy.anonymize.extractors import NERExtractor
        >>> labels = [{"label": "PERSON", "type": "string"}]
        >>> extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH)
        >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
        Doc, [Entity]

    Attributes:
        labels (List[dict]): The list of labels to extract.
        lang (str): The language of the text to extract.
        score_th (float): The score threshold.
        use_gpu (bool): Whether to use GPU.
        gliner_model (str): The gliner model to use.
        pipeline (Language): The spacy pipeline for extracting entities.
        spacy_style (str): The style the entities should be stored in the spacy doc.

    Methods:
        __call__(self, text):
            Extract the entities from the text.
        display(self, doc):
            Display the entities in the text.

    """

    def __init__(
        self,
        labels: List[dict],
        *args,
        lang: LANGUAGES = LANGUAGES.ENGLISH,
        score_th: float = 0.5,
        use_gpu: bool = False,
        gliner_model: str = "urchade/gliner_multi_pii-v1",
        spacy_style: str = "ent",
        **kwargs,
    ):
        """Initialize the named entity recognition (NER) extractor.

        Examples:
            >>> from anonipy.constants import LANGUAGES
            >>> from anonipy.anonymize.extractors import NERExtractor
            >>> labels = [{"label": "PERSON", "type": "string"}]
            >>> extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH)
            NERExtractor()

        Args:
            labels: The list of labels to extract.
            lang: The language of the text to extract.
            score_th: The score threshold. Entities with a score below this threshold will be ignored.
            use_gpu: Whether to use GPU.
            gliner_model: The gliner model to use to identify the entities.
            spacy_style: The style the entities should be stored in the spacy doc. Options: `ent` or `span`.

        """

        super().__init__(labels, *args, **kwargs)
        self.lang = lang
        self.score_th = score_th
        self.use_gpu = use_gpu
        self.gliner_model = gliner_model
        self.spacy_style = spacy_style
        self.labels = self._prepare_labels(labels)

        with warnings.catch_warnings():
            # TODO: remove once the GLiNER package includes the fix (inproper file closing)
            warnings.filterwarnings("ignore", category=ResourceWarning)
            self.pipeline = self._prepare_pipeline()

    def __call__(self, text: str, detect_repeats: bool = False, *args, **kwargs) -> Tuple[Doc, List[Entity]]:
        """Extract the entities from the text.

        Examples:
            >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
            Doc, [Entity]

        Args:
            text: The text to extract entities from.
            detect_repeats: Whether to check text again for repeated entities.

        Returns:
            The spacy document.
            The list of extracted entities.

        """

        doc = self.pipeline(text)
        anoni_entities, spacy_entities = self._prepare_entities(doc)

        if detect_repeats:
            anoni_entities = detect_repeated_entities(doc, anoni_entities, self.spacy_style)

        create_spacy_entities(doc, anoni_entities, self.spacy_style)

        return doc, anoni_entities

    def display(self, doc: Doc, page: bool = False, jupyter: bool = None) -> str:
        """Display the entities in the text.

        Examples:
            >>> doc, entities = extractor("John Doe is a 19 year old software engineer.")
            >>> extractor.display(doc)
            HTML

        Args:
            doc: The spacy doc to display.
            page: Whether to display the doc in a web browser.
            jupyter: Whether to display the doc in a jupyter notebook.

        Returns:
            The HTML representation of the document and the extracted entities.

        """

        options = {
            "colors": {l["label"]: get_label_color(l["label"]) for l in self.labels}
        }
        return displacy.render(
            doc, style=self.spacy_style, options=options, page=page, jupyter=jupyter
        )

    # ===========================================
    # Private methods
    # ===========================================

    def _prepare_labels(self, labels: List[dict]) -> List[dict]:
        """Prepare the labels for the extractor.

        The provided labels are enriched with the corresponding regex
        definitions, if the `regex` key was not provided.

        Args:
            labels: The list of labels to prepare.

        Returns:
            The enriched labels.

        """
        for l in labels:
            if "regex" in l:
                continue
            regex = regex_mapping[l["type"]]
            if regex is not None:
                l["regex"] = regex
        return labels

    def _create_gliner_config(self) -> dict:
        """Create the config for the GLINER model.

        Returns:
            The configuration dictionary for the GLINER model.

        """

        map_location = "cpu"
        if self.use_gpu and not torch.cuda.is_available():
            return warnings.warn(
                "The user requested GPU use, but not available GPU was found. Reverting back to CPU use."
            )
        if self.use_gpu and torch.cuda.is_available():
            map_location = "cuda"

        return {
            # the model is specialized for extracting PII data
            "gliner_model": self.gliner_model,
            "labels": [l["label"] for l in self.labels],
            "threshold": self.score_th,
            "chunk_size": 384,
            "style": self.spacy_style,
            "map_location": map_location,
        }

    def _prepare_pipeline(self) -> Language:
        """Prepare the spacy pipeline.

        Prepares the pipeline for processing the text in the corresponding
        provided language.

        Returns:
            The spacy text processing and extraction pipeline.

        """

        # load the appropriate parser for the language
        module_lang, class_lang = self.lang[0].lower(), self.lang[1].lower().title()
        language_module = importlib.import_module(f"spacy.lang.{module_lang}")
        language_class = getattr(language_module, class_lang)
        # initialize the language parser
        nlp = language_class()
        nlp.add_pipe("sentencizer")
        gliner_config = self._create_gliner_config()
        nlp.add_pipe("gliner_spacy", config=gliner_config)
        return nlp

    def _prepare_entities(self, doc: Doc) -> Tuple[List[Entity], List[Span]]:
        """Prepares the anonipy and spacy entities.

        Args:
            doc: The spacy doc to prepare.

        Returns:
            The list of anonipy entities.
            The list of spacy entities.

        """

        # TODO: make this part more generic
        anoni_entities = []
        spacy_entities = []
        for s in get_doc_entity_spans(doc, self.spacy_style):
            label = list(filter(lambda x: x["label"] == s.label_, self.labels))[0]
            if re.match(label["regex"], s.text):
                anoni_entities.append(convert_spacy_to_entity(s, **label))
                spacy_entities.append(s)
        return anoni_entities, spacy_entities

__init__(labels, *args, lang=LANGUAGES.ENGLISH, score_th=0.5, use_gpu=False, gliner_model='urchade/gliner_multi_pii-v1', spacy_style='ent', **kwargs)

Initialize the named entity recognition (NER) extractor.

Examples:

>>> from anonipy.constants import LANGUAGES
>>> from anonipy.anonymize.extractors import NERExtractor
>>> labels = [{"label": "PERSON", "type": "string"}]
>>> extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH)
NERExtractor()

Parameters:

Name Type Description Default
labels List[dict]

The list of labels to extract.

required
lang LANGUAGES

The language of the text to extract.

ENGLISH
score_th float

The score threshold. Entities with a score below this threshold will be ignored.

0.5
use_gpu bool

Whether to use GPU.

False
gliner_model str

The gliner model to use to identify the entities.

'urchade/gliner_multi_pii-v1'
spacy_style str

The style the entities should be stored in the spacy doc. Options: ent or span.

'ent'
Source code in anonipy/anonymize/extractors/ner_extractor.py
def __init__(
    self,
    labels: List[dict],
    *args,
    lang: LANGUAGES = LANGUAGES.ENGLISH,
    score_th: float = 0.5,
    use_gpu: bool = False,
    gliner_model: str = "urchade/gliner_multi_pii-v1",
    spacy_style: str = "ent",
    **kwargs,
):
    """Initialize the named entity recognition (NER) extractor.

    Examples:
        >>> from anonipy.constants import LANGUAGES
        >>> from anonipy.anonymize.extractors import NERExtractor
        >>> labels = [{"label": "PERSON", "type": "string"}]
        >>> extractor = NERExtractor(labels, lang=LANGUAGES.ENGLISH)
        NERExtractor()

    Args:
        labels: The list of labels to extract.
        lang: The language of the text to extract.
        score_th: The score threshold. Entities with a score below this threshold will be ignored.
        use_gpu: Whether to use GPU.
        gliner_model: The gliner model to use to identify the entities.
        spacy_style: The style the entities should be stored in the spacy doc. Options: `ent` or `span`.

    """

    super().__init__(labels, *args, **kwargs)
    self.lang = lang
    self.score_th = score_th
    self.use_gpu = use_gpu
    self.gliner_model = gliner_model
    self.spacy_style = spacy_style
    self.labels = self._prepare_labels(labels)

    with warnings.catch_warnings():
        # TODO: remove once the GLiNER package includes the fix (inproper file closing)
        warnings.filterwarnings("ignore", category=ResourceWarning)
        self.pipeline = self._prepare_pipeline()

__call__(text, detect_repeats=False, *args, **kwargs)

Extract the entities from the text.

Examples:

>>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
Doc, [Entity]

Parameters:

Name Type Description Default
text str

The text to extract entities from.

required
detect_repeats bool

Whether to check text again for repeated entities.

False

Returns:

Type Description
Doc

The spacy document.

List[Entity]

The list of extracted entities.

Source code in anonipy/anonymize/extractors/ner_extractor.py
def __call__(self, text: str, detect_repeats: bool = False, *args, **kwargs) -> Tuple[Doc, List[Entity]]:
    """Extract the entities from the text.

    Examples:
        >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
        Doc, [Entity]

    Args:
        text: The text to extract entities from.
        detect_repeats: Whether to check text again for repeated entities.

    Returns:
        The spacy document.
        The list of extracted entities.

    """

    doc = self.pipeline(text)
    anoni_entities, spacy_entities = self._prepare_entities(doc)

    if detect_repeats:
        anoni_entities = detect_repeated_entities(doc, anoni_entities, self.spacy_style)

    create_spacy_entities(doc, anoni_entities, self.spacy_style)

    return doc, anoni_entities

display(doc, page=False, jupyter=None)

Display the entities in the text.

Examples:

>>> doc, entities = extractor("John Doe is a 19 year old software engineer.")
>>> extractor.display(doc)
HTML

Parameters:

Name Type Description Default
doc Doc

The spacy doc to display.

required
page bool

Whether to display the doc in a web browser.

False
jupyter bool

Whether to display the doc in a jupyter notebook.

None

Returns:

Type Description
str

The HTML representation of the document and the extracted entities.

Source code in anonipy/anonymize/extractors/ner_extractor.py
def display(self, doc: Doc, page: bool = False, jupyter: bool = None) -> str:
    """Display the entities in the text.

    Examples:
        >>> doc, entities = extractor("John Doe is a 19 year old software engineer.")
        >>> extractor.display(doc)
        HTML

    Args:
        doc: The spacy doc to display.
        page: Whether to display the doc in a web browser.
        jupyter: Whether to display the doc in a jupyter notebook.

    Returns:
        The HTML representation of the document and the extracted entities.

    """

    options = {
        "colors": {l["label"]: get_label_color(l["label"]) for l in self.labels}
    }
    return displacy.render(
        doc, style=self.spacy_style, options=options, page=page, jupyter=jupyter
    )

anonipy.anonymize.extractors.PatternExtractor

Bases: ExtractorInterface

The class representing the pattern extractor.

Examples:

>>> from anonipy.constants import LANGUAGES
>>> from anonipy.anonymize.extractors import PatternExtractor
>>> labels = [{"label": "PERSON", "type": "string", "regex": "([A-Z][a-z]+ [A-Z][a-z]+)"}]
>>> extractor = PatternExtractor(labels, lang=LANGUAGES.ENGLISH)
>>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
Doc, [Entity]

Attributes:

Name Type Description
labels List[dict]

The list of labels and patterns to extract.

lang str

The language of the text to extract.

pipeline Language

The spacy pipeline for extracting entities.

token_matchers Matcher

The spacy token pattern matcher.

global_matchers function

The global pattern matcher.

Methods:

Name Description
__call__

Extract the entities from the text.

display

Display the entities in the text.

Source code in anonipy/anonymize/extractors/pattern_extractor.py
class PatternExtractor(ExtractorInterface):
    """The class representing the pattern extractor.

    Examples:
        >>> from anonipy.constants import LANGUAGES
        >>> from anonipy.anonymize.extractors import PatternExtractor
        >>> labels = [{"label": "PERSON", "type": "string", "regex": "([A-Z][a-z]+ [A-Z][a-z]+)"}]
        >>> extractor = PatternExtractor(labels, lang=LANGUAGES.ENGLISH)
        >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
        Doc, [Entity]

    Attributes:
        labels (List[dict]): The list of labels and patterns to extract.
        lang (str): The language of the text to extract.
        pipeline (Language): The spacy pipeline for extracting entities.
        token_matchers (Matcher): The spacy token pattern matcher.
        global_matchers (function): The global pattern matcher.

    Methods:
        __call__(self, text):
            Extract the entities from the text.
        display(self, doc):
            Display the entities in the text.

    """

    def __init__(
        self,
        labels: List[dict],
        *args,
        lang: LANGUAGES = LANGUAGES.ENGLISH,
        spacy_style: str = "ent",
        **kwargs,
    ):
        """Initialize the pattern extractor.

        Examples:
            >>> from anonipy.constants import LANGUAGES
            >>> from anonipy.anonymize.extractors import PatternExtractor
            >>> labels = [{"label": "PERSON", "type": "string", "regex": "([A-Z][a-z]+ [A-Z][a-z]+)"}]
            >>> extractor = PatternExtractor(labels, lang=LANGUAGES.ENGLISH)
            PatternExtractor()

        Args:
            labels: The list of labels and patterns to extract.
            lang: The language of the text to extract.
            spacy_style: The style the entities should be stored in the spacy doc. Options: `ent` or `span`.

        """

        super().__init__(labels, *args, **kwargs)
        self.lang = lang
        self.labels = labels
        self.spacy_style = spacy_style
        self.pipeline = self._prepare_pipeline()
        self.token_matchers = self._prepare_token_matchers()
        self.global_matchers = self._prepare_global_matchers()

    def __call__(self, text: str, detect_repeats: bool = False, *args, **kwargs) -> Tuple[Doc, List[Entity]]:
        """Extract the entities from the text.

        Examples:
            >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
            Doc, [Entity]

        Args:
            text: The text to extract entities from.
            detect_repeats: Whether to check text again for repeated entities.

        Returns:
            The spacy document.
            The list of extracted entities.

        """

        doc = self.pipeline(text)
        self.token_matchers(doc) if self.token_matchers else None
        self.global_matchers(doc) if self.global_matchers else None
        anoni_entities, spacy_entities = self._prepare_entities(doc)

        if detect_repeats:
            anoni_entities = detect_repeated_entities(doc, anoni_entities, self.spacy_style)

        create_spacy_entities(doc, anoni_entities, self.spacy_style)

        return doc, anoni_entities

    def display(self, doc: Doc, page: bool = False, jupyter: bool = None) -> str:
        """Display the entities in the text.

        Examples:
            >>> doc, entities = extractor("John Doe is a 19 year old software engineer.")
            >>> extractor.display(doc)
            HTML

        Args:
            doc: The spacy doc to display.
            page: Whether to display the doc in a web browser.
            jupyter: Whether to display the doc in a jupyter notebook.

        Returns:
            The HTML representation of the document and the extracted entities.

        """

        options = {
            "colors": {l["label"]: get_label_color(l["label"]) for l in self.labels}
        }
        return displacy.render(
            doc, style=self.spacy_style, options=options, page=page, jupyter=jupyter
        )

    # ===========================================
    # Private methods
    # ===========================================

    def _prepare_pipeline(self) -> Language:
        """Prepare the spacy pipeline.

        Prepares the pipeline for processing the text in the corresponding
        provided language.

        Returns:
            The spacy text processing and extraction pipeline.

        """

        # load the appropriate parser for the language
        module_lang, class_lang = self.lang[0].lower(), self.lang[1].lower().title()
        language_module = importlib.import_module(f"spacy.lang.{module_lang}")
        language_class = getattr(language_module, class_lang)
        # initialize the language parser
        nlp = language_class()
        nlp.add_pipe("sentencizer")
        return nlp

    def _prepare_token_matchers(self) -> Optional[Matcher]:
        """Prepare the token pattern matchers.

        Prepares the token pattern matchers for the provided labels.

        Returns:
            The spacy matcher object or None if no relevant labels are provided.

        """

        relevant_labels = list(filter(lambda l: "pattern" in l, self.labels))
        if len(relevant_labels) == 0:
            return None

        matcher = Matcher(self.pipeline.vocab)
        for label in relevant_labels:
            if isinstance(label["pattern"], list):
                on_match = self._create_add_event_ent(label["label"])
                matcher.add(label["label"], label["pattern"], on_match=on_match)
        return matcher

    def _prepare_global_matchers(self) -> Optional[Callable]:
        """Prepares the global pattern matchers.

        Prepares the global pattern matchers for the provided labels.

        Returns:
            The function used to match the patterns or None if no relevant labels are provided.

        """

        relevant_labels = list(filter(lambda l: "regex" in l, self.labels))
        if len(relevant_labels) == 0:
            return None

        def global_matchers(doc: Doc) -> None:
            for label in relevant_labels:
                for match in re.finditer(label["regex"], doc.text):
                    # define the entity span
                    start, end = match.span(1)
                    entity = doc.char_span(start, end, label=label["label"])
                    if not entity:
                        continue
                    entity._.score = 1.0
                    entities = [convert_spacy_to_entity(entity)]
                    # add the entity to the previous entity list
                    create_spacy_entities(doc, entities, self.spacy_style)

        return global_matchers

    def _prepare_entities(self, doc: Doc) -> Tuple[List[Entity], List[Span]]:
        """Prepares the anonipy and spacy entities.

        Args:
            doc: The spacy doc to prepare.

        Returns:
            The list of anonipy entities.
            The list of spacy entities.

        """

        # TODO: make this part more generic
        anoni_entities = []
        spacy_entities = []
        for e in get_doc_entity_spans(doc, self.spacy_style):
            label = list(filter(lambda x: x["label"] == e.label_, self.labels))[0]
            anoni_entities.append(convert_spacy_to_entity(e, **label))
            spacy_entities.append(e)
        return anoni_entities, spacy_entities

    def _create_add_event_ent(self, label: str) -> Callable:
        """Create the add event entity function

        Args:
            label: The identified label entity.

        Returns:
            The function used to add the entity to the spacy doc.

        """

        def add_event_ent(matcher, doc, i, matches):
            # define the entity span
            _, start, end = matches[i]
            entity = Span(doc, start, end, label=label)
            if not entity:
                return
            entity._.score = 1.0
            entities = [convert_spacy_to_entity(entity)]
            create_spacy_entities(doc, entities, self.spacy_style)

        return add_event_ent

__init__(labels, *args, lang=LANGUAGES.ENGLISH, spacy_style='ent', **kwargs)

Initialize the pattern extractor.

Examples:

>>> from anonipy.constants import LANGUAGES
>>> from anonipy.anonymize.extractors import PatternExtractor
>>> labels = [{"label": "PERSON", "type": "string", "regex": "([A-Z][a-z]+ [A-Z][a-z]+)"}]
>>> extractor = PatternExtractor(labels, lang=LANGUAGES.ENGLISH)
PatternExtractor()

Parameters:

Name Type Description Default
labels List[dict]

The list of labels and patterns to extract.

required
lang LANGUAGES

The language of the text to extract.

ENGLISH
spacy_style str

The style the entities should be stored in the spacy doc. Options: ent or span.

'ent'
Source code in anonipy/anonymize/extractors/pattern_extractor.py
def __init__(
    self,
    labels: List[dict],
    *args,
    lang: LANGUAGES = LANGUAGES.ENGLISH,
    spacy_style: str = "ent",
    **kwargs,
):
    """Initialize the pattern extractor.

    Examples:
        >>> from anonipy.constants import LANGUAGES
        >>> from anonipy.anonymize.extractors import PatternExtractor
        >>> labels = [{"label": "PERSON", "type": "string", "regex": "([A-Z][a-z]+ [A-Z][a-z]+)"}]
        >>> extractor = PatternExtractor(labels, lang=LANGUAGES.ENGLISH)
        PatternExtractor()

    Args:
        labels: The list of labels and patterns to extract.
        lang: The language of the text to extract.
        spacy_style: The style the entities should be stored in the spacy doc. Options: `ent` or `span`.

    """

    super().__init__(labels, *args, **kwargs)
    self.lang = lang
    self.labels = labels
    self.spacy_style = spacy_style
    self.pipeline = self._prepare_pipeline()
    self.token_matchers = self._prepare_token_matchers()
    self.global_matchers = self._prepare_global_matchers()

__call__(text, detect_repeats=False, *args, **kwargs)

Extract the entities from the text.

Examples:

>>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
Doc, [Entity]

Parameters:

Name Type Description Default
text str

The text to extract entities from.

required
detect_repeats bool

Whether to check text again for repeated entities.

False

Returns:

Type Description
Doc

The spacy document.

List[Entity]

The list of extracted entities.

Source code in anonipy/anonymize/extractors/pattern_extractor.py
def __call__(self, text: str, detect_repeats: bool = False, *args, **kwargs) -> Tuple[Doc, List[Entity]]:
    """Extract the entities from the text.

    Examples:
        >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
        Doc, [Entity]

    Args:
        text: The text to extract entities from.
        detect_repeats: Whether to check text again for repeated entities.

    Returns:
        The spacy document.
        The list of extracted entities.

    """

    doc = self.pipeline(text)
    self.token_matchers(doc) if self.token_matchers else None
    self.global_matchers(doc) if self.global_matchers else None
    anoni_entities, spacy_entities = self._prepare_entities(doc)

    if detect_repeats:
        anoni_entities = detect_repeated_entities(doc, anoni_entities, self.spacy_style)

    create_spacy_entities(doc, anoni_entities, self.spacy_style)

    return doc, anoni_entities

display(doc, page=False, jupyter=None)

Display the entities in the text.

Examples:

>>> doc, entities = extractor("John Doe is a 19 year old software engineer.")
>>> extractor.display(doc)
HTML

Parameters:

Name Type Description Default
doc Doc

The spacy doc to display.

required
page bool

Whether to display the doc in a web browser.

False
jupyter bool

Whether to display the doc in a jupyter notebook.

None

Returns:

Type Description
str

The HTML representation of the document and the extracted entities.

Source code in anonipy/anonymize/extractors/pattern_extractor.py
def display(self, doc: Doc, page: bool = False, jupyter: bool = None) -> str:
    """Display the entities in the text.

    Examples:
        >>> doc, entities = extractor("John Doe is a 19 year old software engineer.")
        >>> extractor.display(doc)
        HTML

    Args:
        doc: The spacy doc to display.
        page: Whether to display the doc in a web browser.
        jupyter: Whether to display the doc in a jupyter notebook.

    Returns:
        The HTML representation of the document and the extracted entities.

    """

    options = {
        "colors": {l["label"]: get_label_color(l["label"]) for l in self.labels}
    }
    return displacy.render(
        doc, style=self.spacy_style, options=options, page=page, jupyter=jupyter
    )

anonipy.anonymize.extractors.MultiExtractor

The class representing the multi extractor.

Examples:

>>> from anonipy.constants import LANGUAGES
>>> from anonipy.anonymize.extractors import NERExtractor, PatternExtractor, MultiExtractor
>>> extractors = [
>>>     NERExtractor(ner_labels, lang=LANGUAGES.ENGLISH),
>>>     PatternExtractor(pattern_labels, lang=LANGUAGES.ENGLISH),
>>> ]
>>> extractor = MultiExtractor(extractors)
>>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
[(Doc, [Entity]), (Doc, [Entity])], [Entity]

Attributes:

Name Type Description
extractors List[ExtractorInterface]

The list of extractors to use.

Methods:

Name Description
__call__

Extract the entities fron the text using the provided extractors.

display

Display the entities extracted from the text document.

Source code in anonipy/anonymize/extractors/multi_extractor.py
class MultiExtractor:
    """The class representing the multi extractor.

    Examples:
        >>> from anonipy.constants import LANGUAGES
        >>> from anonipy.anonymize.extractors import NERExtractor, PatternExtractor, MultiExtractor
        >>> extractors = [
        >>>     NERExtractor(ner_labels, lang=LANGUAGES.ENGLISH),
        >>>     PatternExtractor(pattern_labels, lang=LANGUAGES.ENGLISH),
        >>> ]
        >>> extractor = MultiExtractor(extractors)
        >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
        [(Doc, [Entity]), (Doc, [Entity])], [Entity]

    Attributes:
        extractors (List[ExtractorInterface]):
            The list of extractors to use.

    Methods:
        __call__(self, text):
            Extract the entities fron the text using the provided extractors.
        display(self, doc):
            Display the entities extracted from the text document.

    """

    def __init__(self, extractors: List[ExtractorInterface]):
        """Initialize the multi extractor.

        Examples:
            >>> from anonipy.constants import LANGUAGES
            >>> from anonipy.anonymize.extractors import NERExtractor, PatternExtractor, MultiExtractor
            >>> extractors = [
            >>>     NERExtractor(ner_labels, lang=LANGUAGES.ENGLISH),
            >>>     PatternExtractor(pattern_labels, lang=LANGUAGES.ENGLISH),
            >>> ]
            >>> extractor = MultiExtractor(extractors)
            MultiExtractor()

        Args:
            extractors: The list of extractors to use.

        """
        if len(extractors) == 0:
            raise ValueError("At least one extractor must be provided.")
        if not all(isinstance(e, ExtractorInterface) for e in extractors):
            raise ValueError("All extractors must be instances of ExtractorInterface.")

        self.extractors = extractors

    def __call__(
        self, text: str, detect_repeats: bool = False
    ) -> Tuple[List[Tuple[Doc, List[Entity]]], List[Entity]]:
        """Extract the entities fron the text using the provided extractors.

        Examples:
            >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
            [(Doc, [Entity]), (Doc, [Entity])], [Entity]

        Args:
            text: The text to extract entities from.
            detect_repeats: Whether to check text again for repeated entities.

        Returns:
            The list of extractor outputs containing the tuple (spacy document, extracted entities).
            The list of joint entities.

        """

        extractor_outputs = [e(text, detect_repeats) for e in self.extractors] 
        joint_entities = merge_entities(extractor_outputs)                          

        return extractor_outputs, joint_entities

    def display(self, doc: Doc, page: bool = False, jupyter: bool = None) -> str:
        """Display the entities in the text.

        Examples:
            >>> extractor_outputs, entities = extractor("John Doe is a 19 year old software engineer.")
            >>> extractor.display(extractor_outputs[0][0])
            HTML

        Args:
            doc: The spacy doc to display.
            page: Whether to display the doc in a web browser.
            jupyter: Whether to display the doc in a jupyter notebook.

        Returns:
            The HTML representation of the document and the extracted entities.

        """

        labels = list(
            itertools.chain.from_iterable([e.labels for e in self.extractors])
        )
        options = {"colors": {l["label"]: get_label_color(l["label"]) for l in labels}}
        return displacy.render(
            doc, style="ent", options=options, page=page, jupyter=jupyter
        )

__init__(extractors)

Initialize the multi extractor.

Examples:

>>> from anonipy.constants import LANGUAGES
>>> from anonipy.anonymize.extractors import NERExtractor, PatternExtractor, MultiExtractor
>>> extractors = [
>>>     NERExtractor(ner_labels, lang=LANGUAGES.ENGLISH),
>>>     PatternExtractor(pattern_labels, lang=LANGUAGES.ENGLISH),
>>> ]
>>> extractor = MultiExtractor(extractors)
MultiExtractor()

Parameters:

Name Type Description Default
extractors List[ExtractorInterface]

The list of extractors to use.

required
Source code in anonipy/anonymize/extractors/multi_extractor.py
def __init__(self, extractors: List[ExtractorInterface]):
    """Initialize the multi extractor.

    Examples:
        >>> from anonipy.constants import LANGUAGES
        >>> from anonipy.anonymize.extractors import NERExtractor, PatternExtractor, MultiExtractor
        >>> extractors = [
        >>>     NERExtractor(ner_labels, lang=LANGUAGES.ENGLISH),
        >>>     PatternExtractor(pattern_labels, lang=LANGUAGES.ENGLISH),
        >>> ]
        >>> extractor = MultiExtractor(extractors)
        MultiExtractor()

    Args:
        extractors: The list of extractors to use.

    """
    if len(extractors) == 0:
        raise ValueError("At least one extractor must be provided.")
    if not all(isinstance(e, ExtractorInterface) for e in extractors):
        raise ValueError("All extractors must be instances of ExtractorInterface.")

    self.extractors = extractors

__call__(text, detect_repeats=False)

Extract the entities fron the text using the provided extractors.

Examples:

>>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
[(Doc, [Entity]), (Doc, [Entity])], [Entity]

Parameters:

Name Type Description Default
text str

The text to extract entities from.

required
detect_repeats bool

Whether to check text again for repeated entities.

False

Returns:

Type Description
List[Tuple[Doc, List[Entity]]]

The list of extractor outputs containing the tuple (spacy document, extracted entities).

List[Entity]

The list of joint entities.

Source code in anonipy/anonymize/extractors/multi_extractor.py
def __call__(
    self, text: str, detect_repeats: bool = False
) -> Tuple[List[Tuple[Doc, List[Entity]]], List[Entity]]:
    """Extract the entities fron the text using the provided extractors.

    Examples:
        >>> extractor("John Doe is a 19 year old software engineer.", detect_repeats=False)
        [(Doc, [Entity]), (Doc, [Entity])], [Entity]

    Args:
        text: The text to extract entities from.
        detect_repeats: Whether to check text again for repeated entities.

    Returns:
        The list of extractor outputs containing the tuple (spacy document, extracted entities).
        The list of joint entities.

    """

    extractor_outputs = [e(text, detect_repeats) for e in self.extractors] 
    joint_entities = merge_entities(extractor_outputs)                          

    return extractor_outputs, joint_entities

display(doc, page=False, jupyter=None)

Display the entities in the text.

Examples:

>>> extractor_outputs, entities = extractor("John Doe is a 19 year old software engineer.")
>>> extractor.display(extractor_outputs[0][0])
HTML

Parameters:

Name Type Description Default
doc Doc

The spacy doc to display.

required
page bool

Whether to display the doc in a web browser.

False
jupyter bool

Whether to display the doc in a jupyter notebook.

None

Returns:

Type Description
str

The HTML representation of the document and the extracted entities.

Source code in anonipy/anonymize/extractors/multi_extractor.py
def display(self, doc: Doc, page: bool = False, jupyter: bool = None) -> str:
    """Display the entities in the text.

    Examples:
        >>> extractor_outputs, entities = extractor("John Doe is a 19 year old software engineer.")
        >>> extractor.display(extractor_outputs[0][0])
        HTML

    Args:
        doc: The spacy doc to display.
        page: Whether to display the doc in a web browser.
        jupyter: Whether to display the doc in a jupyter notebook.

    Returns:
        The HTML representation of the document and the extracted entities.

    """

    labels = list(
        itertools.chain.from_iterable([e.labels for e in self.extractors])
    )
    options = {"colors": {l["label"]: get_label_color(l["label"]) for l in labels}}
    return displacy.render(
        doc, style="ent", options=options, page=page, jupyter=jupyter
    )

anonipy.anonymize.extractors.ExtractorInterface

The class representing the extractor interface.

All extractors should inherit from this class.

Methods:

Name Description
__call__

Extract entities from the text.

Source code in anonipy/anonymize/extractors/interface.py
class ExtractorInterface:
    """The class representing the extractor interface.

    All extractors should inherit from this class.

    Methods:
        __call__(text):
            Extract entities from the text.

    """

    def __init__(self, labels: List[dict], *args, **kwargs):
        pass

    def __call__(self, text: str, *args, **kwargs) -> Tuple[Doc, List[Entity]]:
        pass