Iterators¶

Iterator¶

class podium.datasets.Iterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, matrix_class=<built-in function array>, disable_batch_matrix=False, internal_random_state=None)[source]¶

An iterator that batches data from a dataset after numericalization.

Creates an iterator for the given dataset and batch size.

Parameters

dataset (DatasetBase) – The dataset to iterate over.
batch_size (int) – Batch size for batched iteration. If the dataset size is not a multiple of batch_size the last returned batch will be smaller (len(dataset) % batch_size).
sort_key (callable) – A callable used to sort instances within a batch. If None, batch instances won’t be sorted. Default is None.
shuffle (bool) – Flag denoting whether examples should be shuffled prior to each epoch. Default is False.
seed (int) – The initial random seed. Only used if shuffle=True. Raises ValueError if shuffle=True, internal_random_state=None and seed=None. Default is 1.
matrix_class (callable) – The constructor for the return batch datatype. Defaults to np.array. When working with deep learning frameworks such as tensorflow and pytorch, setting this argument allows customization of the batch datatype.
internal_random_state (tuple) –
The random state that the iterator will be initialized with. Obtained by calling .getstate on an instance of the Random object, exposed through the Iterator.get_internal_random_state method.

For most use-cases, setting the random seed will suffice. This argument is useful when we want to stop iteration at a certain batch of the dataset and later continue exactly where we left off.

If None, the Iterator will create its own random state from the given seed. Only relevant if shuffle=True. A ValueError is raised if shuffle=True, internal_random_state=None and seed=None. Default is None.

Raises

ValueError – If shuffle=True and both seed and internal_random_state are None.

__iter__()[source]¶

Returns an iterator over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch are dict subclasses which unpack to values instead of keys:

>>> batch = Batch({
...    'a': np.array([0]),
...    'b': np.array([1])
... })
>>> a, b = batch
>>> a
array([0])
>>> b
array([1])

Batch keys correspond to dataset Field names. Batch values are by default numpy ndarrays, although the data type can be changed through the matrix_class argument. Rows correspond to dataset instances, while each element is a numericalized value of the input.

Returns: Iterator over batches of examples in the dataset.
Return type: iter

__len__()[source]¶

Returns the number of batches this iterator provides in one epoch.

Returns: Number of batches s provided in one epoch.
Return type: int

get_internal_random_state()[source]¶

Returns the internal random state of the iterator.

Useful if we want to stop iteration at a certain batch, and later continue exactly at that batch..

Only used if shuffle=True.

Returns: The internal random state of the iterator.
Return type: tuple
Raises: RuntimeError – If shuffle=False.

reset()[source]¶: Reset the epoch and iteration counter of the Iterator.

set_dataset(dataset)[source]¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (DatasetBase) – Dataset to iterate over.

set_internal_random_state(state)[source]¶

Sets the internal random state of the iterator.

Useful if we want to stop iteration at a certain batch, and later continue exactly at that batch..

Only used if shuffle=True.

Raises: RuntimeError – If shuffle=False.

property batch_size¶: The batch size of the iterator.

property epoch¶: The current epoch of the Iterator.

property iterations¶: The number of batches returned so far in the current epoch.

property matrix_class¶: The class constructor of the batch matrix.

BucketIterator¶

class podium.datasets.BucketIterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, matrix_class=<built-in function array>, internal_random_state=None, look_ahead_multiplier=100, bucket_sort_key=None)[source]¶

Creates a bucket iterator which uses a look-ahead heuristic to batch examples in a way that minimizes the amount of necessary padding.

Uses a bucket of size N x batch_size, and sorts instances within the bucket before splitting into batches, minimizing necessary padding.

Creates a BucketIterator with the given bucket sort key and look-ahead multiplier (how many batch_sizes to look ahead when sorting examples for batches).

Parameters

look_ahead_multiplier (int) – Multiplier of batch_size which determines the size of the look-ahead bucket. If look_ahead_multiplier == 1, then the BucketIterator behaves like a normal Iterator. If look_ahead_multiplier >= (num_examples / batch_size), then the BucketIterator behaves like a normal iterator that sorts the whole dataset. Default is 100.
bucket_sort_key (callable) – The callable object used to sort examples in the bucket. If bucket_sort_key=None, then the sort_key must not be None, otherwise a ValueError is raised. Default is None.

Raises

ValueError – If sort_key and bucket_sort_key are both None.

SingleBatchIterator¶

class podium.datasets.SingleBatchIterator(dataset=None, shuffle=True, add_padding=True)[source]¶

Iterator that creates one batch per epoch containing all examples in the dataset.

Creates an Iterator that creates one batch per epoch containing all examples in the dataset.

Parameters

dataset (DatasetBase) – The dataset to iterate over.
shuffle (bool) – Flag denoting whether examples should be shuffled before each epoch. Default is False.
add_padding (bool) – Flag denoting whether to add padding to batches yielded by the iterator. If set to False, numericalized Fields will be returned as python lists of matrix_class instances.

set_dataset(dataset)[source]¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (DatasetBase) – Dataset to iterate over.

HierarchicalIterator¶

class podium.datasets.HierarchicalIterator(dataset=None, batch_size=32, sort_key=None, shuffle=False, seed=1, matrix_class=<built-in function array>, internal_random_state=None, context_max_length=None, context_max_depth=None)[source]¶

Iterator used to create batches for Hierarchical Datasets.

Creates batches as lists of matrices. In the returned batch, every attribute corresponds to a field in the dataset. For every field in the dataset, the batch contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.

Creates an iterator for the given dataset and batch size.

Parameters

dataset (DatasetBase) – The dataset to iterate over.
batch_size (int) – Batch size for batched iteration. If the dataset size is not a multiple of batch_size the last returned batch will be smaller (len(dataset) % batch_size).
sort_key (callable) – A callable used to sort instances within a batch. If None, batch instances won’t be sorted. Default is None.
shuffle (bool) – Flag denoting whether examples should be shuffled prior to each epoch. Default is False.
seed (int) – The initial random seed. Only used if shuffle=True. Raises ValueError if shuffle=True, internal_random_state=None and seed=None. Default is 1.
matrix_class (callable) –
The constructor for the return batch datatype. Defaults to np.array. When working with deep learning frameworks such as tensorflow and pytorch, setting this argument allows customization of the batch datatype.
internal_random_state (tuple) –
The random state that the iterator will be initialized with. Obtained by calling .getstate on an instance of the Random object, exposed through the Iterator.get_internal_random_state method.

For most use-cases, setting the random seed will suffice. This argument is useful when we want to stop iteration at a certain batch of the dataset and later continue exactly where we left off.

If None, the Iterator will create its own random state from the given seed. Only relevant if shuffle=True. A ValueError is raised if shuffle=True, internal_random_state=None and seed=None. Default is None.
context_max_depth (int) – The maximum depth of the context retrieved for an example in the batch. While generating the context, the iterator will take ‘context_max_depth’ levels above the example and the root node of the last level, e.g. if 0 is passed, the context generated for an example will contain all examples in the level of the example in the batch and the root example of that level. If None, this depth limit will be ignored.
context_max_length (int) – The maximum length of the context retrieved for an example in the batch. The number of rows in the generated context matrix will be (if needed) truncated to context_max_length - 1. If None, this length limit will be ignored.

Raises

ValueError – If shuffle is True and both seed and internal_random_state are None.

set_dataset(dataset)[source]¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (DatasetBase) – Dataset to iterate over.