Built-in preprocessing tools¶

Hooks¶

Moses Normalizer¶

class podium.preproc.MosesNormalizer(language='en')[source]¶

Pretokenization took that normalizes the raw textual data.

Uses sacremoses.MosesPunctNormalizer to perform normalization.

MosesNormalizer constructor.

Parameters: language (str) – Language argument for the normalizer. Default: “en”.
Raises: ImportError – If sacremoses is not installed.

__call__(raw)[source]¶

Applies normalization to the raw textual data.

Parameters: raw (str) – Raw textual data.
Returns: Normalized textual data.
Return type: str

Regex Replace¶

class podium.preproc.RegexReplace(replace_patterns)[source]¶

Pretokenization hook that applies a sequence of regex substitutions to the raw textual data.

Each substitution corresponds to a 2-tuple consisting of a regex pattern and a string that will replace that pattern.

RegexReplace constructor.

Parameters: replace_patterns (iterable of tuple(Union[re.Pattern, str], str)) – Iterable of 2-tuples where the first element is either a regex pattern or a string and the second element is a string that will replace each occurance of the pattern specified as the first element.

__call__(raw)[source]¶

Applies a sequence of regex substitutions to the raw textual data.

Parameters: raw (str) – Raw textual data.
Returns: Resulting textual data after applying the regex substitutions.
Return type: str

Keyword Extractor¶

class podium.preproc.KeywordExtractor(algorithm, **kwargs)[source]¶

Posttokenization hook that extracts keywords from the raw textual data.

The tokenized data is ignored during this process.

Keyword Extractor constructor.

Parameters

algorithm (str) – The algorithm used to extract keywords. Supported algorithms: rake and yake.
**kwargs (keyword arguments passed to the keyword extraction algorithm.) –

Raises

ImportError – If the keyword extraction algorithm is not installed.
ValueError – If the specified extraction algorithm is not supported.

__call__(raw, tokenized)[source]¶

Extracts keywords from the raw data.

Returns: 2-tuple where the first element is left unchanged and the second elements contains extracted keywords.
Return type: tuple(str, list of str)

Text Cleanup¶

class podium.preproc.TextCleanUp(language='en', fix_unicode=True, to_ascii=True, remove_line_breaks=False, remove_punct=False, replace_url=None, replace_email=None, replace_phone_number=None, replace_number=None, replace_digit=None, replace_currency_symbol=None)[source]¶

Pretokenization hook that cleans up the raw textual data.

Additionally, it supports replacement of urls, emails, phone numbers, numbers, digits, and currency symbols with arbitrary tokens. During the clean up, whitespace is normalized.

TextCleanUp constructor.

Parameters

language (str) – Language argument for the text clean up. Default: “en”.
fix_unicode (bool) – Fix various unicode errors. Default: True.
to_ascii (bool) – Transliterate to closest ASCII representation. Default: True.
remove_line_breaks (bool) – Fully strip line breaks as opposed to only normalizing them. Default: False.
remove_punct (bool) – Fully remove punctuation. Default: False.
replace_url (str, optional) – If defined, token used to replace urls in the input data. Default: None.
replace_email (str, optional) – If defined, token used to replace emails in the input data. Default: None.
replace_phone_number (str, optional) – If defined, token used to replace phone numbers in the input data. Default: None.
replace_number (str, optional) – If defined, token used to replace numbers in the input data. Default: None.
replace_digit (str, optional) – If defined, token used to replace digits in the input data. Default: None.
replace_currency_symbol (str, optional) – If defined, token used to replace currency symbols in the input data. Default: None.

Raises

ValueError – If the given language is not supported.

__call__(raw)[source]¶

Cleans up the raw textual data.

Parameters: raw (str) – Raw textual data.
Returns: Cleaned up textual data.
Return type: str

NLTK Stemmer¶

class podium.preproc.NLTKStemmer(language='en', ignore_stopwords=False)[source]¶

Posttokenization hook that applies stemming to the tokenized textual data.

Uses nltk.stem.SnowballStemmer to perform stemming.

NLTKStemmer constructor.

Parameters

language (str) – The language argument for the stemmer. Default: “en”. For the list of supported language, see: https://www.nltk.org/api/nltk.stem.html.
ignore_stopwords (bool) – If True, stemming is not applied to stopwords. Default: False.

Raises

ValueError – If the given language is not supported.

__call__(raw, tokenized)[source]¶

Stemms the tokenized textual data. The raw part is left unchanged.

Returns: 2-tuple where the first element is left unchanged and the second elements contains stemmed tokens.
Return type: tuple(str, list of str)

Spacy Lemmatizer¶

class podium.preproc.SpacyLemmatizer(language='en', mode='lookup')[source]¶

Posttokenization hook that applies SpaCy Lemmatizer to the tokenized textual data.

If the language model is not installed, an attempt is made to install it.

SpacyLemmatizer constructor.

Parameters

language (str) – Language argument for the lemmatizer. For the list of supported languages, see https://spacy.io/usage/models#languages. Default: “en”.
mode (str) – The lemmatizer mode. By default, the following modes are available: “lookup” and “rule”. Default: “lookup”.

__call__(raw, tokenized)[source]¶

Applies lemmatization to the tokenized textual data. The raw part is left unchanged.

Returns: 2-tuple where the first element is left unchanged and the second elements contains lemmatized tokens.
Return type: tuple(str, list of str)

Truecasing¶

podium.preproc.truecase(oov='title')[source]¶

Returns a pretokenization hook that applies truecasing to the raw textual data.

To use this hook, the truecase library has to be installed.

Parameters

oov (str) –

Defines how to handle out of vocabulary tokens not seen while training the truecasing model. 3 options are supported: title - returns OOV tokens in ‘title’ format lower - returns OOV tokens in lower case as-is - returns OOV tokens as is

Default is ‘title’.

Returns

Function that truecases the raw data.

Return type

callable

Raises

ImportError – If the truecase library is not installed.

Stopwords removal¶

podium.preproc.remove_stopwords(language='en')[source]¶

Returns a posttokenization hook that removes stop words from the tokenized textual data. The raw part is left unchanged.

Stop words are obtained from the corresponding SpaCy language model.

Parameters: language (str) – Language whose stop words will be removed. Default is ‘en’.
Returns: Function that removes stop words from the tokenized part of the input data.
Return type: callable

Notes

This function does not lowercase the tokenized data prior to stopword removal.

Tokenizers¶

podium.preproc.get_tokenizer(tokenizer)[source]¶

Returns a tokenizer according to the parameters given.

Parameters

tokenizer (str | callable) –

If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers. The string must be of format ‘tokenizer’ or tokenizer-args

The available premade tokenizers are:

’split’ - default str.split(). Custom separator can be provided as split-sep where sep is the separator string.
’spacy’ - the spacy tokenizer, using the ‘en’ language model by default . Different language model can be provided as ‘spacy-lang’ where lang is the language model name (e.g. spacy-en). If spacy model is used for the first time, an attempt to install it will be made. If that fails, user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models.
toktok - NLTK’s toktok tokenizer. For more details see https://www.nltk.org/_modules/nltk/tokenize/toktok.html.
moses - Sacremoses’s moses tokenizer. For more details see https://github.com/alvations/sacremoses.

Returns

Return type

The created (or given) tokenizer.

Raises

ImportError – If the required package for the specified tokenizer is not installed.
ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the supported tokenizers.

class podium.preproc.SpacySentencizer(language='en')[source]¶

Detects sentence boundaries and splits the input data on them.

If the language model is not installed, an attempt is made to install it.

Sentencizer constructor.

Parameters: language (str) – Language argument for the sentencizer. For the list of supported languages, see https://spacy.io/usage/models#languages. Default: “en”.

__call__(data)[source]¶

Splits the input data on sentence boundaries.

Parameters: data – The input data.
Returns: The input data split on sentences boundaries.
Return type: list of str

Utilities¶

podium.preproc.as_posttokenize_hook(hook)[source]¶

Transforms a pretokenitation hook to a posttokenization hook.

This function supports only the built-in hooks and raises TypeError for the user-provided hooks.