Iterators¶
Iterator¶
-
class
podium.datasets.
Iterator
(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, matrix_class=<built-in function array>, disable_batch_matrix=False, internal_random_state=None)[source]¶ An iterator that batches data from a dataset after numericalization.
Creates an iterator for the given dataset and batch size.
- Parameters
dataset (DatasetBase) – The dataset to iterate over.
batch_size (int) – Batch size for batched iteration. If the dataset size is not a multiple of batch_size the last returned batch will be smaller (
len(dataset) % batch_size
).sort_key (callable) – A
callable
used to sort instances within a batch. IfNone
, batch instances won’t be sorted. Default isNone
.shuffle (bool) – Flag denoting whether examples should be shuffled prior to each epoch. Default is
False
.seed (int) – The initial random seed. Only used if
shuffle=True
. RaisesValueError
ifshuffle=True
,internal_random_state=None
andseed=None
. Default is1
.matrix_class (callable) – The constructor for the return batch datatype. Defaults to
np.array
. When working with deep learning frameworks such as tensorflow and pytorch, setting this argument allows customization of the batch datatype.internal_random_state (tuple) –
The random state that the iterator will be initialized with. Obtained by calling
.getstate
on an instance of the Random object, exposed through theIterator.get_internal_random_state
method.For most use-cases, setting the random seed will suffice. This argument is useful when we want to stop iteration at a certain batch of the dataset and later continue exactly where we left off.
If
None
, the Iterator will create its own random state from the given seed. Only relevant ifshuffle=True
. AValueError
is raised ifshuffle=True
,internal_random_state=None
andseed=None
. Default isNone
.
- Raises
ValueError – If
shuffle=True
and bothseed
andinternal_random_state
areNone
.
-
__iter__
()[source]¶ Returns an iterator over the given dataset. The iterator yields tuples in the form
(input_batch, target_batch)
. The input_batch and target_batch are dict subclasses which unpack to values instead of keys:>>> batch = Batch({ ... 'a': np.array([0]), ... 'b': np.array([1]) ... }) >>> a, b = batch >>> a array([0]) >>> b array([1])
Batch keys correspond to dataset Field names. Batch values are by default numpy ndarrays, although the data type can be changed through the
matrix_class
argument. Rows correspond to dataset instances, while each element is a numericalized value of the input.- Returns
Iterator over batches of examples in the dataset.
- Return type
iter
-
__len__
()[source]¶ Returns the number of batches this iterator provides in one epoch.
- Returns
Number of batches s provided in one epoch.
- Return type
int
-
get_internal_random_state
()[source]¶ Returns the internal random state of the iterator.
Useful if we want to stop iteration at a certain batch, and later continue exactly at that batch..
Only used if
shuffle=True
.- Returns
The internal random state of the iterator.
- Return type
tuple
- Raises
RuntimeError – If
shuffle=False
.
-
set_dataset
(dataset)[source]¶ Sets the dataset for this Iterator to iterate over. Resets the epoch count.
- Parameters
dataset (DatasetBase) – Dataset to iterate over.
-
set_internal_random_state
(state)[source]¶ Sets the internal random state of the iterator.
Useful if we want to stop iteration at a certain batch, and later continue exactly at that batch..
Only used if
shuffle=True
.- Raises
RuntimeError – If
shuffle=False
.
-
property
batch_size
¶ The batch size of the iterator.
-
property
epoch
¶ The current epoch of the Iterator.
-
property
iterations
¶ The number of batches returned so far in the current epoch.
-
property
matrix_class
¶ The class constructor of the batch matrix.
BucketIterator¶
-
class
podium.datasets.
BucketIterator
(dataset=None, batch_size=32, sort_key=None, shuffle=True, seed=1, matrix_class=<built-in function array>, internal_random_state=None, look_ahead_multiplier=100, bucket_sort_key=None)[source]¶ Creates a bucket iterator which uses a look-ahead heuristic to batch examples in a way that minimizes the amount of necessary padding.
Uses a bucket of size N x batch_size, and sorts instances within the bucket before splitting into batches, minimizing necessary padding.
Creates a BucketIterator with the given bucket sort key and look-ahead multiplier (how many batch_sizes to look ahead when sorting examples for batches).
- Parameters
look_ahead_multiplier (int) – Multiplier of
batch_size
which determines the size of the look-ahead bucket. Iflook_ahead_multiplier == 1
, then the BucketIterator behaves like a normal Iterator. Iflook_ahead_multiplier >= (num_examples / batch_size)
, then the BucketIterator behaves like a normal iterator that sorts the whole dataset. Default is100
.bucket_sort_key (callable) – The callable object used to sort examples in the bucket. If
bucket_sort_key=None
, then thesort_key
must not beNone
, otherwise aValueError
is raised. Default isNone
.
- Raises
ValueError – If sort_key and bucket_sort_key are both None.
SingleBatchIterator¶
-
class
podium.datasets.
SingleBatchIterator
(dataset=None, shuffle=True, add_padding=True)[source]¶ Iterator that creates one batch per epoch containing all examples in the dataset.
Creates an Iterator that creates one batch per epoch containing all examples in the dataset.
- Parameters
dataset (DatasetBase) – The dataset to iterate over.
shuffle (bool) – Flag denoting whether examples should be shuffled before each epoch. Default is
False
.add_padding (bool) – Flag denoting whether to add padding to batches yielded by the iterator. If set to
False
, numericalized Fields will be returned as python lists ofmatrix_class
instances.
HierarchicalIterator¶
-
class
podium.datasets.
HierarchicalIterator
(dataset=None, batch_size=32, sort_key=None, shuffle=False, seed=1, matrix_class=<built-in function array>, internal_random_state=None, context_max_length=None, context_max_depth=None)[source]¶ Iterator used to create batches for Hierarchical Datasets.
Creates batches as lists of matrices. In the returned batch, every attribute corresponds to a field in the dataset. For every field in the dataset, the batch contains a list of matrices, where every matrix represents the context of an example in the batch. The rows of a matrix contain numericalized representations of the examples that make up the context of an example in the batch with the representation of the example itself being in the last row of its own context matrix.
Creates an iterator for the given dataset and batch size.
- Parameters
dataset (DatasetBase) – The dataset to iterate over.
batch_size (int) – Batch size for batched iteration. If the dataset size is not a multiple of batch_size the last returned batch will be smaller (
len(dataset) % batch_size
).sort_key (callable) – A
callable
used to sort instances within a batch. IfNone
, batch instances won’t be sorted. Default isNone
.shuffle (bool) – Flag denoting whether examples should be shuffled prior to each epoch. Default is
False
.seed (int) – The initial random seed. Only used if
shuffle=True
. RaisesValueError
ifshuffle=True
,internal_random_state=None
andseed=None
. Default is1
.matrix_class (callable) –
The constructor for the return batch datatype. Defaults to
np.array
. When working with deep learning frameworks such as tensorflow and pytorch, setting this argument allows customization of the batch datatype.internal_random_state (tuple) –
The random state that the iterator will be initialized with. Obtained by calling
.getstate
on an instance of the Random object, exposed through theIterator.get_internal_random_state
method.For most use-cases, setting the random seed will suffice. This argument is useful when we want to stop iteration at a certain batch of the dataset and later continue exactly where we left off.
If
None
, the Iterator will create its own random state from the given seed. Only relevant ifshuffle=True
. AValueError
is raised ifshuffle=True
,internal_random_state=None
andseed=None
. Default isNone
.context_max_depth (int) – The maximum depth of the context retrieved for an example in the batch. While generating the context, the iterator will take ‘context_max_depth’ levels above the example and the root node of the last level, e.g. if 0 is passed, the context generated for an example will contain all examples in the level of the example in the batch and the root example of that level. If None, this depth limit will be ignored.
context_max_length (int) – The maximum length of the context retrieved for an example in the batch. The number of rows in the generated context matrix will be (if needed) truncated to context_max_length - 1. If None, this length limit will be ignored.
- Raises
ValueError – If shuffle is True and both seed and internal_random_state are None.