Fields and Vocab¶

Field¶

class podium.Field(name, tokenizer='split', keep_raw=False, numericalizer=None, is_target=False, include_lengths=False, fixed_length=None, allow_missing_data=False, disable_batch_matrix=False, disable_numericalize_caching=False, padding_token=- 999, missing_data_token=- 1, pretokenize_hooks=None, posttokenize_hooks=None)[source]¶

Holds the preprocessing and numericalization logic for a single field of a dataset.

Create a Field from arguments.

Parameters
  • name (str) – Field name, used for referencing data in the dataset.

  • tokenizer (str | callable | optional) –

    The tokenizer used when preprocessing raw data.

    Users can provide their own tokenizers as a callable or specify one of the registered tokenizers by passing a string keyword. Available pre-registered tokenizers are:

    • ’split’ - default, str.split(). A custom separator can be provided as split-sep where sep is the separator string.

    • ’spacy-lang’ - the spacy tokenizer. The language model can be defined by replacing lang with the language model name. Example spacy-en_core_web_sm

    If None, the data will not be tokenized. Raw input data will be stored as tokenized.

  • keep_raw (bool) –

    Flag determining whether to store raw data.

    If True, raw data will be stored in the ‘raw’ part of the Example tuple.

  • numericalizer (callable) –

    Object used to numericalize tokens.

    Can be a Vocab, a custom numericalizer callable or None. If the numericalizer is a Vocab instance, Vocab’s padding token will be used instead of the Field’s. If a Callable is passed, it will be used to numericalize data token by token. If None, numericalization won’t be attempted for this Fieldand batches will be created as lists instead of numpy matrices.

  • is_target (bool) –

    Flag indicating whether this field contains a target (response) variable.

    Affects iteration over batches by separating target and non-target Fields.

  • include_lengths (bool) –

    Flag indicating whether the batch representation of this field should include the length of every instance in the batch.

    If True, the batch element under the name of this Field will be a tuple of (numericalized values, lengths).

  • fixed_length (int, optional) – Number indicating to which length should the field length be fixed. If set, every example in the field will be truncated or padded to the given length during batching. If the batched data is not a sequence, this parameter is ignored. If None, the batch data will be padded to the length of the longest instance in each minibatch.

  • allow_missing_data (bool) – A flag determining if the Field allows missing data. If allow_missing_data=False and a None value is present in the raw data, a ValueError will be raised. If allow_missing_data=True, and a None value is present in the raw data, it will be stored and numericalized properly. Defaults to False.

  • disable_batch_matrix (bool) – Flag indicating whether the data contained in this Field should be packed into a matrix. If False, the batch returned by an Iterator or Dataset.batch() will contain a padded matrix of numericalizations for all examples. If True, a list of unpadded numericalizations will be returned instead. For missing data, the value in the list will be None. Defaults to False.

  • disable_numericalize_caching (bool) –

    Flag which determines whether the numericalization of this field should be cached.

    Should be set to True if the numericalization can differ between numericalize function calls for the same instance. When set to False, the numericalization values will be cached upon first computation and reused in each subsqeuent time.

    The flag is passed to the numericalizer to indicate use of its nondeterministic setting. This flag is mainly intended be used in the case of masked language modelling, where we wish the inputs to be masked (nondeterministic), and the outputs (labels) to not be masked while using the same Vocab. Defaults to False.

  • padding_token (int) – Padding token used when numericalizer is a callable. If the numericalizer is None or a Vocab instance, this value is ignored. Defaults to -999.

  • missing_data_token (Union[int, float]) – Token used to mark instance data as missing. For non-numericalizable fields, this parameter is ignored and their value will be None. Defaults to -1.

  • pretokenize_hooks (Iterable[Callable[[Any], Any]]) – Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling add_pretokenize_hook.

  • posttokenize_hooks (Iterable[Callable[[Any, List[str]], Tuple[Any, List[str]]]]) – Iterable containing posttokenization hooks. Providing hooks in this way is identical to calling add_posttokenize_hook.

Raises

ValueError – If the given tokenizer is not a callable or a string, or is a string that doesn’t correspond to any of the registered tokenizers.

__getstate__()[source]¶

Method obtains field state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__setstate__(state)[source]¶

Method sets field state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

add_posttokenize_hook(hook)[source]¶

Add a post-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. Post-tokenization hooks are called only if the Field is sequential (in non-sequential fields there is no tokenization and only pre-tokenization hooks are called). The output of the final post- tokenization hook are the raw and tokenized data that the preprocess function will use to produce its result.

Posttokenize hooks have the following outline:
func post_tok_hook(raw_data, tokenized_data):

raw_out, tokenized_out = do_stuff(raw_data, tokenized_data) return raw_out, tokenized_out

where ‘tokenized_data’ is and ‘tokenized_out’ should be an iterable.

Parameters

hook (callable) – The post-tokenization hook that we want to add to the field.

Raises

If field is declared as non numericalizable. –

add_pretokenize_hook(hook)[source]¶

Add a pre-tokenization hook to the Field. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:
func pre_tok_hook(raw_data):

raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters

hook (callable) – The pre-tokenization hook that we want to add to the field.

property eager¶

A flag that tells whether this field has a Vocab and whether that Vocab is marked as eager.

Returns

Whether this field has a Vocab and whether that Vocab is marked as eager

Return type

bool

finalize()[source]¶

Signals that this field’s vocab can be built.

get_default_value()[source]¶

Method obtains default field value for missing data.

Returns

The index of the missing data token, if this field is numericalizable. None value otherwise.

Return type

missing_symbol index or None

Raises

ValueError – If missing data is not allowed in this field.

get_numericalization_for_example(example, cache=True)[source]¶

Returns the numericalized data of this field for the provided example. The numericalized data is generated and cached in the example if ‘cache’ is true and the cached data is not already present. If already cached, the cached data is returned.

Parameters
  • example (Example) – example to get numericalized data for.

  • cache (bool) – whether to store the cache the calculated numericalization if not already cached

Returns

The numericalized data.

Return type

Union[numpy.ndarray, Any]

get_output_fields()[source]¶

Returns an Iterable of the contained output fields.

Returns

an Iterable of the contained output fields.

Return type

Iterable

property is_finalized¶

Returns whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Returns

Whether the field’s Vocab vas finalized. If the field has no vocab, returns True.

Return type

bool

property name¶

The name of this field.

numericalize(data)[source]¶

Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab.

Parameters

data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.

Returns

Array of stoi indexes of the tokens, if data exists. None, if data is missing and missing data is allowed.

Return type

numpy array

Raises

ValueError – If data is None and missing data is not allowed in this field.

preprocess(data)[source]¶

Preprocesses raw data, tokenizing it if required, updating the vocab if the vocab is eager and preserving the raw data if field’s ‘store_raw’ is true.

Parameters

data (str or iterable(hashable)) – The raw data that needs to be preprocessed.

Returns

A tuple containing one tuple of the format (field_name, (raw, tokenized)). Raw is set to None if keep_raw is disabled. Both raw and tokenized will be set to none if None is passed as data and allow_missing_data is enabled.

Return type

((str, Iterable(hashable)), )

Raises

If data is None and missing data is not allowed. –

remove_posttokenize_hooks()[source]¶

Remove all the post-tokenization hooks that were added to the Field.

remove_pretokenize_hooks()[source]¶

Remove all the pre-tokenization hooks that were added to the Field.

update_vocab(tokenized)[source]¶

Updates the vocab with a data point in its tokenized form. If the field does not do tokenization,

Parameters

tokenized (Union[Any, List(str)]) – The tokenized form of the data point that the vocab is to be updated with.

property use_vocab¶

A flag that tells whether the field uses a vocab or not.

Returns

Whether the field uses a vocab or not.

Return type

bool

property vocab¶

The field’s Vocab or None.

Returns

Returns the field’s Vocab if defined or None.

Return type

Vocab, optional

MultioutputField¶

class podium.MultioutputField(output_fields, tokenizer='split', pretokenize_hooks=None)[source]¶

Field that does pretokenization and tokenization once and passes it to its output fields.

Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).

Field that does pretokenization and tokenization once and passes it to its output fields. Output fields are any type of field. The output fields are used only for posttokenization processing (posttokenization hooks and vocab updating).

Parameters
  • output_fields (List[Field],) – List containig the output fields. The pretokenization hooks and tokenizer in these fields are ignored and only posttokenization hooks are used.

  • tokenizer (Optional[Union[str, Callable]]) –

    The tokenizer that is to be used when preprocessing raw data (only if ‘tokenize’ is True). The user can provide his own tokenizer as a callable object or specify one of the premade tokenizers by a string. The available premade tokenizers are:

    • ’split’ - default str.split()

    • ’spacy-lang’ - the spacy tokenizer. The language model can be defined by replacing lang with the language model name. For example spacy-en_core_web_sm

pretokenize_hooks: Iterable[Callable[[Any], Any]]

Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling add_pretokenize_hook.

add_output_field(field)[source]¶

Adds the passed field to this field’s output fields.

Parameters

field (Field) – Field to add to output fields.

add_pretokenize_hook(hook)[source]¶

Add a pre-tokenization hook to the MultioutputField. If multiple hooks are added to the field, the order of their execution will be the same as the order in which they were added to the field, each subsequent hook taking the output of the previous hook as its input. If the same function is added to the Field as a hook multiple times, it will be executed that many times. The output of the final pre-tokenization hook is the raw data that the tokenizer will get as its input.

Pretokenize hooks have the following signature:
func pre_tok_hook(raw_data):

raw_data_out = do_stuff(raw_data) return raw_data_out

This can be used to eliminate encoding errors in data, replace numbers and names, etc.

Parameters

hook (Callable[[Any], Any]) – The pre-tokenization hook that we want to add to the field.

get_output_fields()[source]¶

Returns an Iterable of the contained output fields.

Returns

an Iterable of the contained output fields.

Return type

Iterable[Field]

preprocess(data)[source]¶

Preprocesses raw data, tokenizing it if required. The outputfields update their vocabs if required and preserve the raw data if the output field’s ‘keep_raw’ is true.

Parameters

data (Any) – The raw data that needs to be preprocessed.

Returns

An Iterable containing the raw and tokenized data of all the output fields. The structure of the returned tuples is (name, (raw, tokenized)), where ‘name’ is the name of the output field and raw and tokenized are processed data.

Return type

Iterable[Tuple[str, Tuple[Optional[Any], Any]]]

Raises

If data is None and missing data is not allowed. –

remove_pretokenize_hooks()[source]¶

Remove all the pre-tokenization hooks that were added to the MultioutputField.

LabelField¶

class podium.LabelField(name, numericalizer=None, allow_missing_data=False, disable_batch_matrix=False, disable_numericalize_caching=False, is_target=True, missing_data_token=- 1, pretokenize_hooks=None)[source]¶

Field subclass used when no tokenization is required.

For example, with a field that has a single value denoting a label.

Field subclass used when no tokenization is required. For example, with a field that has a single value denoting a label.

Parameters
  • name (str) – Field name, used for referencing data in the dataset.

  • numericalizer (callable) – Object used to numericalize tokens. Can either be a Vocab, a custom numericalization callable or None. If it’s a Vocab, this field will update it after preprocessing (or during finalization if eager is False) and use it to numericalize data. Also, the Vocab’s padding token will be used instead of the Field’s. If it’s a Callable, It will be used to numericalize data token by token. If None, numericalization won’t be attempted and batches will be created as lists instead of numpy matrices.

  • allow_missing_data (bool) – Whether the field allows missing data. In the case ‘allow_missing_data’ is False and None is sent to be preprocessed, an ValueError will be raised. If ‘allow_missing_data’ is True, if a None is sent to be preprocessed, it will be stored and later numericalized properly.

  • disable_batch_matrix (bool) – Whether the batch created for this field will be compressed into a matrix. If False, the batch returned by an Iterator or Dataset.batch() will contain a matrix of numericalizations for all examples (if possible). If True, a list of unpadded vectors(or other data type) will be returned instead. For missing data, the value in the list will be None.

  • disable_numericalize_caching (bool) – The flag which determines whether the numericalization of this field should be cached. This flag should be set to True if the numericalization can differ between numericalize function calls for the same instance. When set to False, the numericalization values will be cached and reused each time the instance is used as part of a batch. The flag is passed to the numericalizer to indicate use of its nondeterministic setting. This flag is mainly intended be used in the case of masked language modelling, where we wish the inputs to be masked (nondeterministic), and the outputs (labels) to not be masked while using the same vocabulary.

  • is_target (bool) – Whether this field is a target variable. Affects iteration over batches.

  • missing_data_token (Union[int, float]) – Token to use to mark batch rows as missing. If data for a field is missing, its matrix row will be filled with this value. For non-numericalizable fields, this parameter is ignored and the value will be None.

  • pretokenize_hooks (Iterable[Callable[[Any], Any]]) – Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling add_pretokenize_hook.

MultilabelField¶

class podium.MultilabelField(name, tokenizer=None, numericalizer=None, num_of_classes=None, is_target=True, include_lengths=False, allow_missing_data=False, disable_batch_matrix=False, disable_numericalize_caching=False, missing_data_token=- 1, pretokenize_hooks=None, posttokenize_hooks=None)[source]¶

Field subclass used to get multihot encoded vectors in batches.

Used in cases when a field can have multiple classes active at a time.

Create a MultilabelField from arguments.

Parameters
  • name (str) –

    Field name, used for referencing data in the dataset.

    tokenizerstr | callable | optional

    The tokenizer that is to be used when preprocessing raw data. The user can provide his own tokenizer as a callable object or specify one of the registered tokenizers by a string. The available pre-registered tokenizers are:

    • ’split’ - default str.split(). Custom separator can be provided as split-sep where sep is the separator string.

    • ’spacy-lang’ - the spacy tokenizer. The language model can be defined by replacing lang with the language model name. For example spacy-en_core_web_sm.

    If None, the data will not be tokenized and post-tokenization hooks wont be called. The provided data will be stored in the tokenized data field as-is.

  • numericalizer (callable) – Object used to numericalize tokens. Can either be a Vocab, a custom numericalization callable or None. If it’s a Vocab, this field will update it after preprocessing (or during finalization if eager is False) and use it to numericalize data. The Vocab must not contain any special symbols (like PAD or UNK). If it’s a Callable, It will be used to numericalize data token by token. If None, numericalization won’t be attempted and batches will be created as lists instead of numpy matrices.

  • num_of_classes (int, optional) –

    Number of valid classes.

    Also defines size of the numericalized vector. If none, size of the vocabulary is used.

    is_targetbool

    Whether this field is a target variable. Affects iteration over batches.

    include_lengthsbool

    Whether the batch representation of this field should include the length of every instance in the batch. If true, the batch element under the name of this Field will be a tuple of (numericalized values, lengths).

    allow_missing_databool

    Whether the field allows missing data. In the case ‘allow_missing_data’ is False and None is sent to be preprocessed, an ValueError will be raised. If ‘allow_missing_data’ is True, if a None is sent to be preprocessed, it will be stored and later numericalized properly.

  • disable_batch_matrix (bool) – Whether the batch created for this field will be compressed into a matrix. If False, the batch returned by an Iterator or Dataset.batch() will contain a matrix of numericalizations for all examples (if possible). If True, a list of unpadded vectors(or other data type) will be returned instead. For missing data, the value in the list will be None.

  • disable_numericalize_caching (bool) – The flag which determines whether the numericalization of this field should be cached. This flag should be set to True if the numericalization can differ between numericalize function calls for the same instance. When set to False, the numericalization values will be cached and reused each time the instance is used as part of a batch. The flag is passed to the numericalizer to indicate use of its nondeterministic setting. This flag is mainly intended be used in the case of masked language modelling, where we wish the inputs to be masked (nondeterministic), and the outputs (labels) to not be masked while using the same vocabulary.

  • missing_data_token (Union[int, float]) – Token to use to mark batch rows as missing. If data for a field is missing, its matrix row will be filled with this value. For non-numericalizable fields, this parameter is ignored and the value will be None.

  • pretokenize_hooks (Iterable[Callable[[Any], Any]]) – Iterable containing pretokenization hooks. Providing hooks in this way is identical to calling add_pretokenize_hook.

  • posttokenize_hooks (Iterable[Callable[[Any, List[str]], Tuple[Any, List[str]]]]) – Iterable containing posttokenization hooks. Providing hooks in this way is identical to calling add_posttokenize_hook.

Raises

ValueError – If the provided Vocab contains special symbols.

finalize()[source]¶

Signals that this field’s vocab can be built.

numericalize(data)[source]¶

Numericalize the already preprocessed data point based either on the vocab that was previously built, or on a custom numericalization function, if the field doesn’t use a vocab. Returns a numpy array containing a multihot encoded vector of num_of_classes length.

Parameters

data ((hashable, iterable(hashable))) – Tuple of (raw, tokenized) of preprocessed input data. If the field is sequential, ‘raw’ is ignored and can be None. Otherwise, ‘sequential’ is ignored and can be None.

Returns

One-hot encoded vector of num_of_classes length.

Return type

numpy array

Raises

ValueError – If data is None and missing data is not allowed in this field.

Vocab¶

class podium.Vocab(max_size=None, min_freq=1, specials='<UNK>', '<PAD>', keep_freqs=False, eager=False)[source]¶

Class for storing vocabulary. It supports frequency counting and size limiting.

is_finalized¶

true if the vocab is finalized, false otherwise

Type

bool

itos¶

list of words

Type

list

stoi¶

mapping from word string to index

Type

dict

Vocab constructor. Specials are first in the vocabulary.

Parameters
  • max_size (int) – maximal vocab size

  • min_freq (int) – words with frequency lower than this will be removed

  • specials (Special | Tuple(Special) | None) – collection of special symbols. Can be None.

  • keep_freqs (bool) – if True word frequencies will be saved for later use on the finalization

  • eager (bool) – if True the frequencies will be built immediately upon dataset loading. The main effect of this argument if set to True is that the frequencies of the vocabulary will be built based on all datasets that use this vocabulary, while if set to False, the vocabulary will be built by iterating again over the datasets passed as argument to the finalize_fields function. If you are using multiple datasets and wish to manually control on which subset of dataset splits the vocab is built on, eager should be False. If you are using one or multiple large datasets and/or want to build the vocabulary on all of the splits, eager should be set to True for performance optimization (one loop over the datasets instead of two).

__add__(values)[source]¶

Method allows a vocabulary to be added to current vocabulary or that a set of values is added to the vocabulary.

If max_size if None for any of the two Vocabs, the max_size of the resulting Vocab will also be None. If they are both defined, the max_size of the resulting Vocab will be the sum of max_sizes.

Parameters

values (Iterable or Vocab) – If Vocab, a new Vocab will be created containing all of the special symbols and tokens from both Vocabs. Wheen adding two Vocabs with a different string values for a special token, only the special token instance with the value from the first operand will be used. If Iterable, a new Vocab will be returned containing a copy of this Vocab with the iterables’ tokens added.

Returns

Returns a new Vocab

Return type

Vocab

Raises

RuntimeError – If this vocab is finalized and values are tried to be added, or if both Vocabs are not either both finalized or not finalized.

__eq__(other)[source]¶

Two vocabs are same if they have same finalization status, their stoi and itos mappings are same and their frequency counters are same.

Parameters

other (object) – object for which we want to knwo equality propertiy

Returns

equal – true if two vocabs are same, false otherwise

Return type

bool

__getitem__(idx_or_token: int) → int[source]¶
__getitem__(idx_or_token: str) → str

Returns the token index of the passed token. If the passed token has no index, UNK token index is returned. Otherwise, an exception is raised.

Parameters

token (str) – token whose index is to be returned.

Returns

stoi index of the token.

Return type

int

Raises

KeyError – If the passed token has no index and vocab has no UNK special token.

__iadd__(values)[source]¶

Adds additional values or another Vocab to this Vocab.

Parameters

values (Iterable or Vocab) –

Values to be added to this Vocab. If Vocab, all of the token frequencies and specials from that Vocab will be added to this Vocab. Wheen adding two Vocabs with a different string values for a special token, only the special token instance with the valuefrom the LHS operand will be used.

If Iterable, all of the tokens from the Iterable will be added to this Vocab, increasing the frequencies of those tokens.

Returns

vocab – Returns current Vocab instance to enable chaining

Return type

Vocab

Raises
  • RuntimeError – If the current vocab is finalized, if ‘values’ is a string or if the RHS Vocab doesn’t contain token frequencies.

  • TypeError – If the values cannot be iterated over.

__iter__()[source]¶

Method returns iterator over vocabulary, if the vocabulary is not finalized iteration is done over frequency counter and special symbols are not included, otherwise it is performed on itos and special symbols are included.

Returns

iterator over vocab tokens

Return type

iter

__len__()[source]¶

Method calculates vocab lengths including special symbols.

Returns

length – vocab size including special symbols

Return type

int

finalize()[source]¶

Method finalizes vocab building. It also releases frequency counter if user set not to keep them.

Raises

RuntimeError – If the vocab is already finalized.

classmethod from_itos(itos)[source]¶

Method constructs a vocab from a predefined index-to-string mapping.

Parameters

itos (list | tuple) – The index-to-string mapping for tokens in the vocabulary

classmethod from_stoi(stoi)[source]¶

Method constructs a vocab from a predefined index-to-string mapping.

Parameters

stoi (dict) – The string-to-index mapping for the vocabulary

get_freqs()[source]¶

Method obtains vocabulary frequencies. :returns: freq – mapping frequency for every word :rtype: Counter

Raises

RuntimeError – If the user stated that he doesn’t want to keep frequencies and the vocab is finalized.

get_padding_index()[source]¶

Method returns padding symbol index.

Returns

pad_symbol_index – padding symbol index in the vocabulary

Return type

int

Raises

ValueError – If the padding symbol is not present in the vocabulary.

numericalize(data)[source]¶

Method numericalizes given tokens.

Parameters

data (str | iter(str)) – a single token or iterable collection of tokens

Returns

numericalized_vector – numpy array of numericalized tokens

Return type

array-like

Raises

RuntimeError – If the vocabulary is not finalized.

reverse_numericalize(numericalized_data, include_unk=False)[source]¶

Transforms an iterable containing numericalized data into a list of tokens. The tokens are read from this Vocab’s itos and no additional processing is done.

Parameters

numericalized_data (Iterable) – data to be reverse numericalized

Returns

a list of tokens

Return type

list

Raises

RuntimeError – If the vocabulary is not finalized.

Special tokens¶

class podium.vocab.Special(token=None)[source]¶

Base class for a special token.

Every special token is a subclass of string (this way one can) easily modify the concrete string representation of the special. The functionality of the special token, which acts the same as a post-tokenization hook should be implemented in the apply instance method for each subclass. We ensure that each special token will be present in the Vocab.

Provides default value initialization for subclasses.

If creating a new instance without a string argument, the token class attribute must be set in the subclass implementation.

__eq__(other)[source]¶

Check equals via class instead of value.

The motivation behind this is that we want to be able to match the special token by class and not by value, as it is the type of the special token that determines its functionality. This way we allow for the concrete string representation of the special to be easily changed, while retaining simple existence checks for vocab functionality.

__hash__()[source]¶

Overrides hash.

Check docs of __eq__ for motivation.

static __new__(cls, token=None)[source]¶

Provides default value initialization for subclasses.

If creating a new instance without a string argument, the token class attribute must be set in the subclass implementation.

apply(sequence)[source]¶

Apply (insert) the special token in the adequate place in the sequence.

By default, returns the unchanged sequence.

The unknown token¶

class podium.vocab.UNK(token=None)[source]¶

The unknown core special token.

Functionality handled by Vocab.

Provides default value initialization for subclasses.

If creating a new instance without a string argument, the token class attribute must be set in the subclass implementation.

The padding token¶

class podium.vocab.PAD(token=None)[source]¶

The padding core special token.

Functionality handled by Vocab.

Provides default value initialization for subclasses.

If creating a new instance without a string argument, the token class attribute must be set in the subclass implementation.

The beginning-of-sequence token¶

class podium.vocab.BOS(token=None)[source]¶

The beginning-of-sequence special token.

Provides default value initialization for subclasses.

If creating a new instance without a string argument, the token class attribute must be set in the subclass implementation.

apply(sequence)[source]¶

Apply the BOS token, adding it to the start of the sequence.

The end-of-sequence token¶

class podium.vocab.EOS(token=None)[source]¶

The end-of-sequence special token.

Provides default value initialization for subclasses.

If creating a new instance without a string argument, the token class attribute must be set in the subclass implementation.

apply(sequence)[source]¶

Apply the EOS token, adding it to the end of the sequence.