Fields and Vocab¶
Field¶
-
class
podium.
Field
(name, tokenizer='split', keep_raw=False, numericalizer=None, is_target=False, include_lengths=False, fixed_length=None, allow_missing_data=False, disable_batch_matrix=False, disable_numericalize_caching=False, padding_token=- 999, missing_data_token=- 1, pretokenize_hooks=None, posttokenize_hooks=None)[source]¶ Holds the preprocessing and numericalization logic for a single field of a dataset.
Create a Field from arguments.
- Parameters
name (str) – Field name, used for referencing data in the dataset.
tokenizer (str | callable | optional) –
The tokenizer used when preprocessing raw data.
Users can provide their own tokenizers as a
callable
or specify one of the registered tokenizers by passing a string keyword. Available pre-registered tokenizers are:’split’ - default,
str.split()
. A custom separator can be provided assplit-sep
wheresep
is the separator string.’spacy-lang’ - the spacy tokenizer. The language model can be defined by replacing
lang
with the language model name. Examplespacy-en_core_web_sm
If
None
, the data will not be tokenized. Raw input data will be stored astokenized
.keep_raw (bool) –
Flag determining whether to store raw data.
If
True
, raw data will be stored in the ‘raw’ part of the Example tuple.numericalizer (callable) –
Object used to numericalize tokens.
Can be a
Vocab
, a custom numericalizer callable orNone
. If the numericalizer is aVocab
instance, Vocab’s padding token will be used instead of the Field’s. If aCallable
is passed, it will be used to numericalize data token by token. IfNone
, numericalization won’t be attempted for this Fieldand batches will be created as lists instead of numpy matrices.is_target (bool) –
Flag indicating whether this field contains a target (response) variable.
Affects iteration over batches by separating target and non-target Fields.
include_lengths (bool) –
Flag indicating whether the batch representation of this field should include the length of every instance in the batch.
If
True
, the batch element under the name of this Field will be a tuple of(numericalized values, lengths)
.fixed_length (int, optional) – Number indicating to which length should the field length be fixed. If set, every example in the field will be truncated or padded to the given length during batching. If the batched data is not a sequence, this parameter is ignored. If
None
, the batch data will be padded to the length of the longest instance in each minibatch.allow_missing_data (bool) – A flag determining if the Field allows missing data. If
allow_missing_data=False
and aNone
value is present in the raw data, aValueError
will be raised. Ifallow_missing_data=True
, and aNone
value is present in the raw data, it will be stored and numericalized properly. Defaults toFalse
.disable_batch_matrix (bool) – Flag indicating whether the data contained in this Field should be packed into a matrix. If
False
, the batch returned by anIterator
orDataset.batch()
will contain a padded matrix of numericalizations for all examples. IfTrue
, a list of unpadded numericalizations will be returned instead. For missing data, the value in the list will be None. Defaults toFalse
.disable_numericalize_caching (bool) –
Flag which determines whether the numericalization of this field should be cached.
Should be set to
True
if the numericalization can differ betweennumericalize
function calls for the same instance. When set toFalse
, the numericalization values will be cached upon first computation and reused in each subsqeuent time.The flag is passed to the numericalizer to indicate use of its nondeterministic setting. This flag is mainly intended be used in the case of masked language modelling, where we wish the inputs to be masked (nondeterministic), and the outputs (labels) to not be masked while using the same Vocab. Defaults to
False
.padding_token (int) – Padding token used when numericalizer is a callable. If the numericalizer is
None
or aVocab
instance, this value is ignored. Defaults to-999
.missing_data_token (Union[int, float]) – Token used to mark instance data as missing. For non-numericalizable fields, this parameter is ignored and their value will be
None
. Defaults to-1
.pretokenize_hooks (Iterable[Callable[[Any], Any]]) – Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling
add_pretokenize_hook
.posttokenize_hooks (Iterable[Callable[[Any, List[str]], Tuple[Any, List[str]]]]) – Iterable containing posttokenization hooks. Providing hooks in this way is identical to calling
add_posttokenize_hook
.
- Raises
ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the registered tokenizers.
-
__getstate__
()[source]¶ Method obtains field state. It is used for pickling dataset data to file.
- Returns
state – dataset state dictionary
- Return type
dict
-
__setstate__
(state)[source]¶ Method sets field state. It is used for unpickling dataset data from file.
- Parameters
state (dict) – dataset state dictionary
-
add_posttokenize_hook
(hook)[source]¶ Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post- tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.
- Posttokenize hooks have the following outline:
- func post_tok_hook(raw_data, tokenized_data):
raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out
where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.
- Parameters
hook (callable) – The post-tokenization hook that we want to add to the field.
- Raises
If field is declared as non numericalizable. –
-
add_pretokenize_hook
(hook)[source]¶ Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.
- Pretokenize hooks have the following signature:
- func pre_tok_hook(raw_data):
raw_data_out = do_stuff(raw_data) return raw_data_out
This can be used to eliminate encoding errors in data, replace numbers and names, etc.
- Parameters
hook (callable) – The pre-tokenization hook that we want to add to the field.
-
property
eager
¶ A flag that tells whether this field has a Vocab and whether that Vocab is marked as eager.
- Returns
Whether this field has a Vocab and whether that Vocab is marked as eager
- Return type
bool
-
get_default_value
()[source]¶ Method obtains default field value for missing data.
- Returns
The index of the missing data token, if this field is numericalizable. None value otherwise.
- Return type
missing_symbol index or None
- Raises
ValueError – If missing data is not allowed in this field.
-
get_numericalization_for_example
(example, cache=True)[source]¶ Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.
- Parameters
example (Example) – example to get numericalized data for.
cache (bool) – whether to store the cache the calculated numericalization if not already cached
- Returns
The numericalized data.
- Return type
Union[numpy.ndarray, Any]
-
get_output_fields
()[source]¶ Returns an Iterable of the contained output fields.
- Returns
an Iterable of the contained output fields.
- Return type
Iterable
-
property
is_finalized
¶ Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
- Returns
Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.
- Return type
bool
-
property
name
¶ The name of this field.
-
numericalize
(data)[source]¶ Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.
- Parameters
data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.
- Returns
Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.
- Return type
numpy array
- Raises
ValueError – If data is None and missing data is not allowed in this field.
-
preprocess
(data)[source]¶ Preprocesses raw data, tokenizing it if required, updating the vocab if the vocab is eager and preserving the raw data if field’s ‘store_raw’ is true.
- Parameters
data (str or iterable(hashable)) – The raw data that needs to be preprocessed.
- Returns
A tuple containing one tuple of the format (field_name, (raw, tokenized)). Raw is set to None if keep_raw is disabled. Both raw and tokenized will be set to none if None is passed as data and allow_missing_data is enabled.
- Return type
((str, Iterable(hashable)), )
- Raises
If data is None and missing data is not allowed. –
-
remove_posttokenize_hooks
()[source]¶ Remove all the post-tokenization hooks that were added to the Field.
-
remove_pretokenize_hooks
()[source]¶ Remove all the pre-tokenization hooks that were added to the Field.
-
update_vocab
(tokenized)[source]¶ Updates the vocab with a data point in its tokenized form. If the field does not do tokenization,
- Parameters
tokenized (Union[Any, List(str)]) – The tokenized form of the data point that the vocab is to be updated with.
-
property
use_vocab
¶ A flag that tells whether the field uses a vocab or not.
- Returns
Whether the field uses a vocab or not.
- Return type
bool
MultioutputField¶
-
class
podium.
MultioutputField
(output_fields, tokenizer='split', pretokenize_hooks=None)[source]¶ Field that does pretokenization and tokenization once and passes it to its output fields.
Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).
Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).
- Parameters
output_fields (List[Field],) – List containig the output fields. The pretokenization hooks and tokenizer in these fields are ignored and only posttokenization hooks are used.
tokenizer (Optional[Union[str, Callable]]) –
The tokenizer that is to be used when preprocessing raw data (only if ‘tokenize’ is True). The user can provide his own tokenizer as a callable object or specify one of the premade tokenizers by a string. The available premade tokenizers are:
’split’ - default str.split()
’spacy-lang’ - the spacy tokenizer. The language model can be defined by replacing lang with the language model name. For example spacy-en_core_web_sm
- pretokenize_hooks: Iterable[Callable[[Any], Any]]
Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling add_pretokenize_hook.
-
add_output_field
(field)[source]¶ Adds the passed field to this field’s output fields.
- Parameters
field (Field) – Field to add to output fields.
-
add_pretokenize_hook
(hook)[source]¶ Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.
- Pretokenize hooks have the following signature:
- func pre_tok_hook(raw_data):
raw_data_out = do_stuff(raw_data) return raw_data_out
This can be used to eliminate encoding errors in data, replace numbers and names, etc.
- Parameters
hook (Callable[[Any], Any]) – The pre-tokenization hook that we want to add to the field.
-
get_output_fields
()[source]¶ Returns an Iterable of the contained output fields.
- Returns
an Iterable of the contained output fields.
- Return type
Iterable[Field]
-
preprocess
(data)[source]¶ Preprocesses raw data, tokenizing it if required. The outputfields update their vocabs if required and preserve the raw data if the output field’s ‘keep_raw’ is true.
- Parameters
data (Any) – The raw data that needs to be preprocessed.
- Returns
An Iterable containing the raw and tokenized data of all the output fields. The structure of the returned tuples is (name, (raw, tokenized)), where ‘name’ is the name of the output field and raw and tokenized are processed data.
- Return type
Iterable[Tuple[str, Tuple[Optional[Any], Any]]]
- Raises
If data is None and missing data is not allowed. –
LabelField¶
-
class
podium.
LabelField
(name, numericalizer=None, allow_missing_data=False, disable_batch_matrix=False, disable_numericalize_caching=False, is_target=True, missing_data_token=- 1, pretokenize_hooks=None)[source]¶ Field subclass used when no tokenization is required.
For example, with a field that has a single value denoting a label.
Field subclass used when no tokenization is required. For example, with a field that has a single value denoting a label.
- Parameters
name (str) – Field name, used for referencing data in the dataset.
numericalizer (callable) – Object used to numericalize tokens. Can either be a Vocab, a custom numericalization callable or None. If it’s a Vocab, this field will update it after preprocessing (or during finalization if eager is False) and use it to numericalize data. Also, the Vocab’s padding token will be used instead of the Field’s. If it’s a Callable, It will be used to numericalize data token by token. If None, numericalization won’t be attempted and batches will be created as lists instead of numpy matrices.
allow_missing_data (bool) – Whether the field allows missing data. In the case ‘allow_missing_data’ is False and None is sent to be preprocessed, an ValueError will be raised. If ‘allow_missing_data’ is True, if a None is sent to be preprocessed, it will be stored and later numericalized properly.
disable_batch_matrix (bool) – Whether the batch created for this field will be compressed into a matrix. If False, the batch returned by an Iterator or Dataset.batch() will contain a matrix of numericalizations for all examples (if possible). If True, a list of unpadded vectors(or other data type) will be returned instead. For missing data, the value in the list will be None.
disable_numericalize_caching (bool) – The flag which determines whether the numericalization of this field should be cached. This flag should be set to True if the numericalization can differ between numericalize function calls for the same instance. When set to False, the numericalization values will be cached and reused each time the instance is used as part of a batch. The flag is passed to the numericalizer to indicate use of its nondeterministic setting. This flag is mainly intended be used in the case of masked language modelling, where we wish the inputs to be masked (nondeterministic), and the outputs (labels) to not be masked while using the same vocabulary.
is_target (bool) – Whether this field is a target variable. Affects iteration over batches.
missing_data_token (Union[int, float]) – Token to use to mark batch rows as missing. If data for a field is missing, its matrix row will be filled with this value. For non-numericalizable fields, this parameter is ignored and the value will be None.
pretokenize_hooks (Iterable[Callable[[Any], Any]]) – Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling add_pretokenize_hook.
MultilabelField¶
-
class
podium.
MultilabelField
(name, tokenizer=None, numericalizer=None, num_of_classes=None, is_target=True, include_lengths=False, allow_missing_data=False, disable_batch_matrix=False, disable_numericalize_caching=False, missing_data_token=- 1, pretokenize_hooks=None, posttokenize_hooks=None)[source]¶ Field subclass used to get multihot encoded vectors in batches.
Used in cases when a field can have multiple classes active at a time.
Create a MultilabelField from arguments.
- Parameters
name (str) –
Field name, used for referencing data in the dataset.
- tokenizerstr | callable | optional
The tokenizer that is to be used when preprocessing raw data. The user can provide his own tokenizer as a callable object or specify one of the registered tokenizers by a string. The available pre-registered tokenizers are:
’split’ - default str.split(). Custom separator can be provided as split-sep where sep is the separator string.
’spacy-lang’ - the spacy tokenizer. The language model can be defined by replacing lang with the language model name. For example spacy-en_core_web_sm.
If None, the data will not be tokenized and post-tokenization hooks wont be called. The provided data will be stored in the tokenized data field as-is.
numericalizer (callable) – Object used to numericalize tokens. Can either be a Vocab, a custom numericalization callable or None. If it’s a Vocab, this field will update it after preprocessing (or during finalization if eager is False) and use it to numericalize data. The Vocab must not contain any special symbols (like PAD or UNK). If it’s a Callable, It will be used to numericalize data token by token. If None, numericalization won’t be attempted and batches will be created as lists instead of numpy matrices.
num_of_classes (int, optional) –
- Number of valid classes.
Also defines size of the numericalized vector. If none, size of the vocabulary is used.
- is_targetbool
Whether this field is a target variable. Affects iteration over batches.
- include_lengthsbool
Whether the batch representation of this field should include the length of every instance in the batch. If true, the batch element under the name of this Field will be a tuple of (numericalized values, lengths).
- allow_missing_databool
Whether the field allows missing data. In the case ‘allow_missing_data’ is False and None is sent to be preprocessed, an ValueError will be raised. If ‘allow_missing_data’ is True, if a None is sent to be preprocessed, it will be stored and later numericalized properly.
disable_batch_matrix (bool) – Whether the batch created for this field will be compressed into a matrix. If False, the batch returned by an Iterator or Dataset.batch() will contain a matrix of numericalizations for all examples (if possible). If True, a list of unpadded vectors(or other data type) will be returned instead. For missing data, the value in the list will be None.
disable_numericalize_caching (bool) – The flag which determines whether the numericalization of this field should be cached. This flag should be set to True if the numericalization can differ between numericalize function calls for the same instance. When set to False, the numericalization values will be cached and reused each time the instance is used as part of a batch. The flag is passed to the numericalizer to indicate use of its nondeterministic setting. This flag is mainly intended be used in the case of masked language modelling, where we wish the inputs to be masked (nondeterministic), and the outputs (labels) to not be masked while using the same vocabulary.
missing_data_token (Union[int, float]) – Token to use to mark batch rows as missing. If data for a field is missing, its matrix row will be filled with this value. For non-numericalizable fields, this parameter is ignored and the value will be None.
pretokenize_hooks (Iterable[Callable[[Any], Any]]) – Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling add_pretokenize_hook.
posttokenize_hooks (Iterable[Callable[[Any, List[str]], Tuple[Any, List[str]]]]) – Iterable containing posttokenization hooks. Providing hooks in this way is identical to calling add_posttokenize_hook.
- Raises
ValueError – If the provided Vocab contains special symbols.
-
numericalize
(data)[source]¶ Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab. Returns a numpy array containing a multihot encoded vector of num_of_classes length.
- Parameters
data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.
- Returns
One-hot encoded vector of num_of_classes length.
- Return type
numpy array
- Raises
ValueError – If data is None and missing data is not allowed in this field.
Vocab¶
-
class
podium.
Vocab
(max_size=None, min_freq=1, specials='<UNK>', '<PAD>', keep_freqs=False, eager=False)[source]¶ Class for storing vocabulary. It supports frequency counting and size limiting.
-
is_finalized
¶ true if the vocab is finalized, false otherwise
- Type
bool
-
itos
¶ list of words
- Type
list
-
stoi
¶ mapping from word string to index
- Type
dict
Vocab constructor. Specials are first in the vocabulary.
- Parameters
max_size (int) – maximal vocab size
min_freq (int) – words with frequency lower than this will be removed
specials (Special | Tuple(Special) | None) – collection of special symbols. Can be None.
keep_freqs (bool) – if True word frequencies will be saved for later use on the finalization
eager (bool) – if True the frequencies will be built immediately upon dataset loading. The main effect of this argument if set to True is that the frequencies of the vocabulary will be built based on all datasets that use this vocabulary, while if set to False, the vocabulary will be built by iterating again over the datasets passed as argument to the finalize_fields function. If you are using multiple datasets and wish to manually control on which subset of dataset splits the vocab is built on, eager should be False. If you are using one or multiple large datasets and/or want to build the vocabulary on all of the splits, eager should be set to True for performance optimization (one loop over the datasets instead of two).
-
__add__
(values)[source]¶ Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.
If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.
- Parameters
values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. Wheen adding two Vocabs with a different string values for a special token, only the special token instance with the value from the first operand will be used. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.
- Returns
Returns a new Vocab
- Return type
- Raises
RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.
-
__eq__
(other)[source]¶ Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.
- Parameters
other (object) – object for which we want to knwo equality propertiy
- Returns
equal – true if two vocabs are same, false otherwise
- Return type
bool
-
__getitem__
(idx_or_token: int) → int[source]¶ -
__getitem__
(idx_or_token: str) → str Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.
- Parameters
token (str) – token whose index is to be returned.
- Returns
stoi index of the token.
- Return type
int
- Raises
KeyError – If the passed token has no index and vocab has no UNK special token.
-
__iadd__
(values)[source]¶ Adds additional values or another Vocab to this Vocab.
- Parameters
values (Iterable or Vocab) –
Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab. Wheen adding two Vocabs with a different string values for a special token, only the special token instance with the valuefrom the LHS operand will be used.
If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.
- Returns
vocab – Returns current Vocab instance to enable chaining
- Return type
- Raises
RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.
TypeError – If the values cannot be iterated over.
-
__iter__
()[source]¶ Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.
- Returns
iterator over vocab tokens
- Return type
iter
-
__len__
()[source]¶ Method calculates vocab lengths including special symbols.
- Returns
length – vocab size including special symbols
- Return type
int
-
finalize
()[source]¶ Method finalizes vocab building. It also releases frequency counter if user set not to keep them.
- Raises
RuntimeError – If the vocab is already finalized.
-
classmethod
from_itos
(itos)[source]¶ Method constructs a vocab from a predefined index-to-string mapping.
- Parameters
itos (list | tuple) – The index-to-string mapping for tokens in the vocabulary
-
classmethod
from_stoi
(stoi)[source]¶ Method constructs a vocab from a predefined index-to-string mapping.
- Parameters
stoi (dict) – The string-to-index mapping for the vocabulary
-
get_freqs
()[source]¶ Method obtains vocabulary frequencies. :returns: freq – mapping frequency for every word :rtype: Counter
- Raises
RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.
-
get_padding_index
()[source]¶ Method returns padding symbol index.
- Returns
pad_symbol_index – padding symbol index in the vocabulary
- Return type
int
- Raises
ValueError – If the padding symbol is not present in the vocabulary.
-
numericalize
(data)[source]¶ Method numericalizes given tokens.
- Parameters
data (str | iter(str)) – a single token or iterable collection of tokens
- Returns
numericalized_vector – numpy array of numericalized tokens
- Return type
array-like
- Raises
RuntimeError – If the vocabulary is not finalized.
-
reverse_numericalize
(numericalized_data, include_unk=False)[source]¶ Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.
- Parameters
numericalized_data (Iterable) – data to be reverse numericalized
- Returns
a list of tokens
- Return type
list
- Raises
RuntimeError – If the vocabulary is not finalized.
-
Special tokens¶
-
class
podium.vocab.
Special
(token=None)[source]¶ Base class for a special token.
Every special token is a subclass of string (this way one can) easily modify the concrete string representation of the special. The functionality of the special token, which acts the same as a post-tokenization hook should be implemented in the apply instance method for each subclass. We ensure that each special token will be present in the Vocab.
Provides default value initialization for subclasses.
If creating a new instance without a string argument, the token class attribute must be set in the subclass implementation.
-
__eq__
(other)[source]¶ Check equals via class instead of value.
The motivation behind this is that we want to be able to match the special token by class and not by value, as it is the type of the special token that determines its functionality. This way we allow for the concrete string representation of the special to be easily changed, while retaining simple existence checks for vocab functionality.
-