Iterators¶

Iterator¶

class podium.datasets.Iterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, matrix_class=<built-in function array>, internal_random_state=None)[source]¶

An iterator that batches data from a dataset after numericalization.

Creates an iterator for the given dataset and batch size.

Parameters

dataset (DatasetBase) – The dataset whose examples the iterator will iterate over.
batch_size (int) – The size of the batches that the iterator will return. If the number of examples in the dataset is not a multiple of batch_size the last returned batch will be smaller (dataset_len MOD batch_size).
sort_key (callable) – A callable object used to sort the instances in a batch. If None, the batch instances won’t be sorted. Default is None.
shuffle (bool) – A flag denoting whether the examples should be shuffled before each iteration. If sort_key is not None, this flag being True may not have any effect since the dataset will always be sorted after being shuffled (the only difference shuffling can make is in the order of elements with the same value of sort_key). Default is False.
seed (int) – The seed that the iterator’s internal random state will be initialized with. Useful when we want repeatable random shuffling. Only relevant if shuffle is True. If shuffle is True and internal_random_state is None, then this must not be None, otherwise a ValueError is raised. Default is 1.
matrix_class (callable) – The constructor for the return batch datatype. Defaults to np.array. When working with deep learning frameworks such as tensorflow and torch, setting the argument accordingly will immediately provide batches in the appropriate framework. Not delegated to keyword arguments so users can pass a function which also immediately casts the batch data to the GPU.
internal_random_state (tuple) – The random state that the iterator will be initialized with. This argument can be obtained by calling .getstate on the instance of the Random object, and is exposed through the Iterator.get_internal_random_state method. For most cases, setting the random seed will suffice, while this argument is useful when we want to stop iteration and later continue where we left off. If None, the iterator will create its own random state from the given seed, that can later be obtained if we want to store it. Only relevant if shuffle is True. If shuffle is True and seed is None, then this must not be None, otherwise a ValueError is raised. Default is None.

Raises

ValueError – If shuffle is True and both seed and internal_random_state are None.

__iter__()[source]¶

Returns an iterator object that knows how to iterate over the given dataset. The iterator yields tuples in the form (input_batch, target_batch). The input_batch and target_batch objects have attributes that correspond to the names of input fields and target fields (respectively) of the dataset. The values of those attributes are numpy matrices, whose rows are the numericalized values of that field in the examples that are in the batch. Rows of sequential fields (that are of variable length) are all padded to a common length. The common length is either the fixed_length attribute of the field or, if that is not given, the maximum length of all examples in the batch.

Returns: Iterator that iterates over batches of examples in the dataset.
Return type: iter

__len__()[source]¶

Returns the number of batches this iterator provides in one epoch.

Returns: Number of batches s provided in one epoch.
Return type: int

property batch_size¶: The batch size of the iterator.

property epoch¶: The current epoch of the Iterator.

get_internal_random_state()[source]¶

Returns the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can store the random state obtained with this method and later initialize another iterator with the same random state and continue iterating.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Returns: The internal random state of the iterator.
Return type: tuple
Raises: RuntimeError – If shuffle is False.

property iterations¶: Number of batches returned for the current epoch.

property matrix_class¶: The class constructor of the batch matrix.

reset()[source]¶: Reset the epoch and iteration counter of the Iterator.

set_dataset(dataset)[source]¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (DatasetBase) – Dataset to iterate over.

set_internal_random_state(state)[source]¶

Sets the internal random state of the iterator.

Useful when we want to stop iteration and later continue where we left off. We can take the random state previously obtained from another iterator to initialize this iterator with the same state and continue iterating where the previous iterator stopped.

Only to be called if shuffle is True, otherwise a RuntimeError is raised.

Raises: RuntimeError – If shuffle is False.

BucketIterator¶

class podium.datasets.BucketIterator(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, matrix_class=<built-in function array>, internal_random_state=None, look_ahead_multiplier=100, bucket_sort_key=None)[source]¶

Creates a bucket iterator that uses a look-ahead heuristic to try and batch examples in a way that minimizes the amount of necessary padding.

It creates a bucket of size N x batch_size, and sorts that bucket before splitting it into batches, so there is less padding necessary.

Creates a BucketIterator with the given bucket sort key and look-ahead multiplier (how many batch_sizes to look ahead when sorting examples for batches).

Parameters

look_ahead_multiplier (int) – Number that denotes how many times the look-ahead bucket is larger than the batch_size. If look_ahead_multiplier == 1, then BucketIterator behaves like a normal iterator except with sorting within the batches. If look_ahead_multiplier >= (num_examples / batch_size), then BucketIterator behaves like a normal iterator that sorts the whole dataset. Default is 100.
bucket_sort_key (callable) – The callable object used to sort the examples in the bucket that is to be batched. If bucket_sort_key is None, then sort_key must not be None, otherwise a ValueError is raised. Default is None.

Raises

ValueError – If sort_key and bucket_sort_key are both None.

SingleBatchIterator¶

class podium.datasets.SingleBatchIterator(dataset=None, shuffle=True)[source]¶

Iterator that creates one batch per epoch containing all examples in the dataset.

Creates an Iterator that creates one batch per epoch containing all examples in the dataset.

Parameters

dataset (DatasetBase) – The dataset whose examples the iterator will iterate over.
shuffle (bool) – A flag denoting whether the examples should be shuffled before each iteration. If sort_key is not None, this flag being True may not have any effect since the dataset will always be sorted after being shuffled (the only difference shuffling can make is in the order of elements with the same value of sort_key).. Default is False.

set_dataset(dataset)[source]¶

Sets the dataset for this Iterator to iterate over. Resets the epoch count.

Parameters: dataset (DatasetBase) – Dataset to iterate over.