Built-in preprocessing tools¶
Hooks¶
Moses Normalizer¶
-
class
podium.preproc.
MosesNormalizer
(language='en')[source]¶ Pretokenization took that normalizes the raw textual data.
Uses sacremoses.MosesPunctNormalizer to perform normalization.
MosesNormalizer constructor.
- Parameters
language (str) – Language argument for the normalizer. Default: “en”.
- Raises
ImportError – If sacremoses is not installed.
Regex Replace¶
-
class
podium.preproc.
RegexReplace
(replace_patterns)[source]¶ Pretokenization hook that applies a sequence of regex substitutions to the raw textual data.
Each substitution corresponds to a 2-tuple consisting of a regex pattern and a string that will replace that pattern.
RegexReplace constructor.
- Parameters
replace_patterns (iterable of tuple(Union[re.Pattern, str], str)) – Iterable of 2-tuples where the first element is either a regex pattern or a string and the second element is a string that will replace each occurance of the pattern specified as the first element.
Keyword Extractor¶
-
class
podium.preproc.
KeywordExtractor
(algorithm, **kwargs)[source]¶ Posttokenization hook that extracts keywords from the raw textual data.
The tokenized data is ignored during this process.
Keyword Extractor constructor.
- Parameters
algorithm (str) – The algorithm used to extract keywords. Supported algorithms: rake and yake.
**kwargs (keyword arguments passed to the keyword extraction algorithm.) –
- Raises
ImportError – If the keyword extraction algorithm is not installed.
ValueError – If the specified extraction algorithm is not supported.
Text Cleanup¶
-
class
podium.preproc.
TextCleanUp
(language='en', fix_unicode=True, to_ascii=True, remove_line_breaks=False, remove_punct=False, replace_url=None, replace_email=None, replace_phone_number=None, replace_number=None, replace_digit=None, replace_currency_symbol=None)[source]¶ Pretokenization hook that cleans up the raw textual data.
Additionally, it supports replacement of urls, emails, phone numbers, numbers, digits, and currency symbols with arbitrary tokens. During the clean up, whitespace is normalized.
TextCleanUp constructor.
- Parameters
language (str) – Language argument for the text clean up. Default: “en”.
fix_unicode (bool) – Fix various unicode errors. Default: True.
to_ascii (bool) – Transliterate to closest ASCII representation. Default: True.
remove_line_breaks (bool) – Fully strip line breaks as opposed to only normalizing them. Default: False.
remove_punct (bool) – Fully remove punctuation. Default: False.
replace_url (str, optional) – If defined, token used to replace urls in the input data. Default: None.
replace_email (str, optional) – If defined, token used to replace emails in the input data. Default: None.
replace_phone_number (str, optional) – If defined, token used to replace phone numbers in the input data. Default: None.
replace_number (str, optional) – If defined, token used to replace numbers in the input data. Default: None.
replace_digit (str, optional) – If defined, token used to replace digits in the input data. Default: None.
replace_currency_symbol (str, optional) – If defined, token used to replace currency symbols in the input data. Default: None.
- Raises
ValueError – If the given language is not supported.
NLTK Stemmer¶
-
class
podium.preproc.
NLTKStemmer
(language='en', ignore_stopwords=False)[source]¶ Posttokenization hook that applies stemming to the tokenized textual data.
Uses nltk.stem.SnowballStemmer to perform stemming.
NLTKStemmer constructor.
- Parameters
language (str) – The language argument for the stemmer. Default: “en”. For the list of supported language, see: https://www.nltk.org/api/nltk.stem.html.
ignore_stopwords (bool) – If True, stemming is not applied to stopwords. Default: False.
- Raises
ValueError – If the given language is not supported.
Spacy Lemmatizer¶
-
class
podium.preproc.
SpacyLemmatizer
(language='en', mode='lookup')[source]¶ Posttokenization hook that applies SpaCy Lemmatizer to the tokenized textual data.
If the language model is not installed, an attempt is made to install it.
SpacyLemmatizer constructor.
- Parameters
language (str) – Language argument for the lemmatizer. For the list of supported languages, see https://spacy.io/usage/models#languages. Default: “en”.
mode (str) – The lemmatizer mode. By default, the following modes are available: “lookup” and “rule”. Default: “lookup”.
Stop words¶
Module contains sets of stop words and stop words removal hook.
Sentencizers¶
Module contains text sentencizer.
-
class
podium.preproc.sentencizers.
SpacySentencizer
(language='en')[source]¶ Bases:
object
Detects sentence boundaries and splits the input data on them.
If the language model is not installed, an attempt is made to install it.
Sentencizer constructor.
- Parameters
language (str) – Language argument for the sentencizer. For the list of supported languages, see https://spacy.io/usage/models#languages. Default: “en”.
Tokenizers¶
Module contains text tokenizers.
-
podium.preproc.tokenizers.
get_tokenizer
(tokenizer)[source]¶ Returns a tokenizer according to the parameters given.
- Parameters
tokenizer (str | callable) –
If a callable object is given, it will just be returned. Otherwise, a string can be given to create one of the premade tokenizers. The string must be of format ‘tokenizer’ or tokenizer-args
- The available premade tokenizers are:
’split’ - default str.split(). Custom separator can be provided as split-sep where sep is the separator string.
’spacy’ - the spacy tokenizer, using the ‘en’ language model by default . Different language model can be provided as ‘spacy-lang’ where lang is the language model name (e.g. spacy-en). If spacy model is used for the first time, an attempt to install it will be made. If that fails, user should download it by using command similar to the following python -m spacy download en. More details can be found in spacy documentation https://spacy.io/usage/models.
toktok - NLTK’s toktok tokenizer. For more details see https://www.nltk.org/_modules/nltk/tokenize/toktok.html.
moses - Sacremoses’s moses tokenizer. For more details see https://github.com/alvations/sacremoses.
- Returns
- Return type
The created (or given) tokenizer.
- Raises
ImportError – If the required package for the specified tokenizer is not installed.
ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the supported tokenizers.
Utilities¶
Module contains utility functions to preprocess text data.
-
podium.preproc.utils.
capitalize_target_like_source
(func)[source]¶ Capitalization decorator of a method that processes a word. Method invokes the parameter function with a lowercased input, then capitalizes the return value such that capitalization corresponds to the original input provided.
- Parameters
func (function) – function which gets called, MUST be a class member with one positional argument (like def func(self, word), but may contain additional keyword arguments (like func(self, word, my_arg=’my_value’))
- Returns
wrapper – decorator function to decorate func with
- Return type
function
-
podium.preproc.utils.
find_word_by_prefix
(trie, word)[source]¶ Searches through a trie data structure and returns an element of the trie is the word is a prefix or exact match of one of the trie elements. Otherwise returns None.
- Parameters
trie (dict) – Nested dict trie data structure
word (str) – String being searched for in the trie data structure
- Returns
found_word – String found which is either the exact word, it’s prefix or None if not found in trie
- Return type
str
-
podium.preproc.utils.
make_trie
(words)[source]¶ Creates a prefix trie data structure given a list of strings. Strings are split into chars and a char nested trie dict is returned.
- Parameters
words (list(str)) – List of strings to create a trie structure from
- Returns
trie – Nested dict trie data structure
- Return type
dict
-
podium.preproc.utils.
uppercase_target_like_source
(source, target)[source]¶ Function uppercases target on the same places source is uppercased.
- Parameters
source (str) – source string from which uppercasing is transfered
target (str) – target string that needs to be uppercased
- Returns
uppercased_target – uppercased target string
- Return type
str
Module contents¶
Package contains modules for preprocessing.