Dataset classes¶
Dataset¶
-
class
podium.
Dataset
(examples, fields, sort_key=None)[source]¶ A general purpose container for datasets.
A dataset is a shallow wrapper for a list of Example instances which contain the dataset data as well as the corresponding Field instances, which (pre)process the columns of each example.
Creates a dataset with the given examples and their fields.
- Parameters
examples (list) – A list of examples.
fields (list) – A list of fields that the examples have been created with.
sort_key (callable) – A key to use for sorting dataset examples, used for batching together examples with similar lengths to minimize padding.
-
__getattr__
(field)¶ Returns an Iterator iterating over values of the field with the given name for every example in the dataset.
- Parameters
field_name (str) – The name of the field whose values are to be returned.
- Returns
An iterable over values of the referenced Field for every dataset Example
- Return type
iterable
- Raises
AttributeError – If there is no Field with the given name.
-
__getitem__
(i)[source]¶ Returns an example or a new dataset containing the indexed examples.
If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.
Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.
Example:
example = dataset[1] # Indexing by single integer returns a single example new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing # the indexed examples.
- Parameters
i (int or slice or iterable) – Index used to index examples.
- Returns
If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
- Return type
single example or Dataset
-
__iter__
()[source]¶ Iterates over all examples in the dataset in order.
- Yields
example – Yields examples in the dataset.
-
__len__
()[source]¶ Returns the number of examples in the dataset.
- Returns
The number of examples in the dataset.
- Return type
int
-
as_dict
(include_raw=False)¶ Converts the entire dataset to a dictionary, where the field names map to lists of processed Examples.
- Parameters
include_raw (bool) – A flag denoting whether raw data should be included in the output dictionary, which can be used to debug your preprocessing pipeline. Defaults to False.
- Returns
The entire dataset as a python dict. Field names are keys, values are lists of (raw, tokenized) data if include_raw is set to True or tokenized data otherwise.
- Return type
dataset_as_dict
-
batch
(add_padding=False)¶ Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the.
- Parameters
add_padding (bool) – A flag indicating whether the dataset should be padded when returned as a single batch. Please note that setting this argument to True can consume a large amount of memory since the dataset will be expanded to [num_instances, max_size] as every instance in the dataset needs to be padded to the size of the longest one. Defaults to False
- Returns
Two objects containing the input and target batches over the whole dataset.
- Return type
input_batch, target_batch
-
field
(name)¶ Returns the Field in a dataset with a given name, if it exists.
- Returns
The referenced Field or None if no Field with the given name exists.
- Return type
Field or None
-
filter
(predicate, inplace=False)[source]¶ Method filters examples with given predicate.
- Parameters
predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
inplace (bool, default False) – if True, do operation inplace and return None
-
filtered
(predicate)[source]¶ Filters examples with given predicate and returns a new DatasetBase instance containing those examples.
- Parameters
predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
- Returns
A new DatasetBase instance containing only the Examples for which predicate returned True.
- Return type
DatasetBase
-
finalize_fields
(*datasets)¶ Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.
- Parameters
*datasets – A variable number of DatasetBase objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).
-
static
from_dataset
(dataset)[source]¶ Creates an Dataset instance from a podium.datasets.DatasetBase instance.
- Parameters
dataset (DatasetBase) – DatasetBase instance to be used to create the Dataset.
- Returns
Dataset instance created from the passed DatasetBase instance.
- Return type
-
classmethod
from_pandas
(df, fields, index_field=None)[source]¶ Creates a Dataset instance from a pandas Dataframe.
- Parameters
df (pandas.Dataframe) – Pandas dataframe from which data will be taken.
fields (Union[Dict[str, Field], List[Field]]) –
A mapping from dataframe columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.
A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).
If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the dataframe).
If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored.
index_field (Optional[Field]) – Field which will be used to process the index column of the Dataframe. If None, the index column will be ignored.
- Returns
Dataset containing data from the Dataframe
- Return type
-
get
(i, deep_copy=False)[source]¶ Returns an example or a new dataset containing the indexed examples.
If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.
Example:
# Indexing by a single integers returns a single example example = dataset.get(1) # Same as the first example, but returns a deep_copy of the example example_copy = dataset.get(1, deep_copy=True) # Multi-indexing returns a new dataset containing the indexed examples s = slice(1, 10) new_dataset = dataset.get(s) new_dataset_copy = dataset.get(s, deep_copy=True)
- Parameters
i (int or slice or iterable) – Index used to index examples.
deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.
- Returns
If i is an int, a single example will be returned. If i is a slice or iterable, a dataset containing only the indexed examples will be returned.
- Return type
single example or Dataset
-
numericalize_examples
()[source]¶ Generates and caches numericalized data for every example in the dataset.
Call before using the dataset to precompute and cache numericalized values and avoid lazy numericalization during iteration. The main difference when calling this class is that the speed of the first epoch will be consistent with each subsequent one.
-
shuffle_examples
(random_state=None)[source]¶ Shuffles the examples in this dataset in-place.
- Parameters
random_state (int) – The random seed used for shuffling.
-
shuffled
()¶ Creates a new DatasetBase instance containing all Examples, but in shuffled order.
- Returns
A new DatasetBase instance containing all Examples, but in shuffled order.
- Return type
DatasetBase
-
sorted
(key, reverse=False)¶ Creates a new DatasetBase instance in which all Examples are sorted according to the value returned by key.
- Parameters
key (callable) – specifies a function of one argument that is used to extract a comparison key from each Example.
reverse (bool) – If set to True, then the list elements are sorted as if each comparison were reversed.
- Returns
A new DatasetBase instance with sorted Examples.
- Return type
DatasetBase
-
split
(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)[source]¶ Creates train-(validation)-test splits from this dataset.
The splits are new Dataset objects, each containing a part of this one’s examples.
- Parameters
split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).
stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.
strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.
random_state (int) – The random seed used for shuffling.
- Returns
Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.
- Return type
tuple[Dataset]
- Raises
ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.
-
to_pandas
(include_raw=False)¶ Creates a pandas dataframe containing all data from this Dataset. By default, only processed data is kept in the dataframe.
If include_raw is True, raw data will also be stored under the column name {field name}_raw, e.g. for a field called ‘Text’, the raw data column name would be Text_raw.
When making pandas dataframes form big DiskBackedDatasets, care should be taken to avoid overusing memory, as the whole dataset is loaded in to memory.
- Parameters
include_raw (bool) – Whether to include raw data in the dataframe.
- Returns
Pandas dataframe containing all examples from this Dataset.
- Return type
DataFrame
-
property
examples
¶ List containing all Examples.
-
property
field_dict
¶ Dictionary mapping the Dataset’s field names to the respective Fields.
-
property
fields
¶ List containing all fields of this dataset.
TabularDataset¶
-
class
podium.datasets.
TabularDataset
(path, fields, format='csv', line2example=None, skip_header=False, csv_reader_params={}, **kwargs)[source]¶ A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.
Creates a TabularDataset from a file containing the data rows and an object containing all the fields that we are interested in.
- Parameters
path (str) – Path to the data file.
fields ((list | dict)) –
A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.
A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).
If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.
If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).
format (str) – The format of the data file. Has to be either “CSV”, “TSV”, “JSON” (case-insensitive). Ignored if line2example is not None. Defaults to “CSV”.
line2example (callable) – The function mapping from a file line to Fields. In case your dataset is not in one of the standardized formats, you can provide a function which performs a custom split for each input line.
skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.
delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.
csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.
- Raises
ValueError – If the format given is not one of “CSV”, “TSV” or “JSON” and line2example is not set. If fields given as a dict and skip_header is True. If format is “JSON” and skip_header is True.
DiskBackedDataset¶
-
class
podium.datasets.arrow.
DiskBackedDataset
(table, fields, cache_path, mmapped_file, data_types=None)[source]¶ Podium dataset implementation which uses PyArrow as its data storage backend.
Examples are stored in a file which is then memory mapped for fast random access.
Creates a new DiskBackedDataset instance. Users should use static constructor functions like ‘from_dataset’ to construct new DiskBackedDataset instances.
- Parameters
table (pyarrow.Table) – Table object that contains example data.
fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the examples in ‘table’.
cache_path (Optional[str]) – Path to the directory where the cache file is saved.
mmapped_file (pyarrow.MemoryMappedFile) – Open MemoryMappedFile descriptor of the cache file.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.
-
__getitem__
(item)[source]¶ Returns an example or a new DiskBackedDataset containing the indexed examples. If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned.
Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.
Usage example:
example = dataset[1] # Indexing by single integer returns a single example
- new_dataset = dataset[1:10:2] # Multiindexing returns a new dataset containing
# the indexed examples.
- new_dataset_2 = dataset[(1,5,6,9)] # Returns a new dataset containing the
# indexed Examples.
- Parameters
item (int or slice or iterable) – Index used to index examples.
- Returns
If i is an int, a single example will be returned. If i is a slice, list or tuple, a copy of this dataset containing only the indexed examples will be returned.
- Return type
Example or Dataset
-
__iter__
()[source]¶ Iterates over Examples in this dataset.
- Returns
Iterator over all the examples in this dataset.
- Return type
Iterator[Example]
-
__len__
()[source]¶ Returns the number of Examples in this Dataset.
- Returns
The number of Examples in this Dataset.
- Return type
int
-
close
()[source]¶ Closes resources held by the DiskBackedDataset.
Only closes the cache file handle. The cache will not be deleted from disk. For cache deletion, use delete_cache.
-
delete_cache
()[source]¶ Deletes the cache directory and all cache files belonging to this dataset.
After this call is executed, any DiskBackedDataset created by slicing/indexing this dataset and any view over this dataset will not be usable any more. Any dataset created from this dataset should be dumped to a new directory before calling this method.
-
dump_cache
(cache_path=None)[source]¶ Saves this dataset at cache_path. Dumped datasets can be loaded with the DiskBackedDataset.load_cache static method. All fields contained in this dataset must be serializable using pickle.
- Parameters
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.
If None, a temporary directory will be created.
- Returns
The chosen cache directory path. Useful when cache_path is None and a temporary directory is created.
- Return type
str
-
static
from_dataset
(dataset, cache_path=None, data_types=None)[source]¶ Creates a DiskBackedDataset instance from a podium.datasets.DatasetBase instance.
- Parameters
dataset (DatasetBase) – DatasetBase instance to be used to create the DiskBackedDataset.
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.
If None, a temporary directory will be created.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.
- Returns
DiskBackedDataset instance created from the passed DatasetBase instance.
- Return type
-
static
from_examples
(fields, examples, cache_path=None, data_types=None, chunk_size=1024)[source]¶ Creates a DiskBackedDataset from the provided Examples.
- Parameters
fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the Examples.
examples (Iterable[Example]) – Iterable of examples.
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.
If None, a temporary directory will be created.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.
chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.
- Returns
DiskBackedDataset instance created from the passed Examples.
- Return type
-
classmethod
from_pandas
(df, fields, index_field=None, cache_path=None, data_types=None, chunk_size=1024)[source]¶ Creates a DiskBackedDataset instance from a pandas Dataframe.
- Parameters
df (pandas.Dataframe) – Pandas dataframe from which data will be taken.
fields (Union[Dict[str, Field], List[Field]]) –
A mapping from dataframe columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.
A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).
If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the dataframe).
If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored.
index_field (Optional[Field]) – Field which will be used to process the index column of the Dataframe. If None, the index column will be ignored.
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.
If None, a temporary directory will be created.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.
chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.
- Returns
Dataset containing data from the Dataframe
- Return type
-
static
from_tabular_file
(path, format, fields, cache_path=None, data_types=None, chunk_size=10000, skip_header=False, delimiter=None, csv_reader_params=None)[source]¶ Loads a tabular file format (csv, tsv, json) as a DiskBackedDataset.
- Parameters
path (str) – Path to the data file.
format (str) – The format of the data file. Has to be either “CSV”, “TSV”, or “JSON” (case-insensitive).
fields (Union[Dict[str, Field], List[Field]]) –
A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.
A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).
If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.
If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.
If None, a temporary directory will be created.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.
chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.
skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.
delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.
csv_reader_params (Dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.
- Returns
DiskBackedDataset instance containing the examples from the tabular file.
- Return type
-
static
load_cache
(cache_path)[source]¶ Loads a cached DiskBackedDataset contained in the cache_path directory. Fields will be loaded into memory but the Example data will be memory mapped avoiding unnecessary memory usage.
- Parameters
cache_path (Optional[str]) –
Path to the directory where the cache file will be saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.
If None, a temporary directory will be created.
- Returns
the DiskBackedDataset loaded from the passed cache directory.
- Return type
HFDatasetConverter¶
-
class
podium.datasets.hf.
HFDatasetConverter
(hf_dataset, fields=None)[source]¶ Class for converting rows from the HuggingFace Datasets to podium.Examples.
HFDatasetConverter constructor.
- Parameters
hf_dataset (datasets.Dataset) – HuggingFace Dataset.
fields (dict(str, podium.Field)) – Dictionary that maps a column name of the dataset to a podium.Field. If passed None the default feature conversion rules will be used to build a dictonary from the dataset features.
- Raises
TypeError – If dataset is not an instance of datasets.Dataset.
-
as_dataset
()[source]¶ Convert the original HuggingFace dataset to a podium.Dataset.
- Returns
podium.Dataset instance.
- Return type
-
static
from_dataset_dict
(dataset_dict, fields=None, cast_to_podium=False)[source]¶ Copies the keys of given dictionary and converts the corresponding HuggingFace Datasets to the HFDatasetConverter instances.
- Parameters
dataset_dict (dict(str, datasets.Dataset)) – Dictionary that maps dataset names to HuggingFace Datasets.
cast_to_podium (bool) – Determines whether to immediately convert the HuggingFace dataset to Podium dataset (if True), or shallowly wrap the HuggingFace dataset in the HFDatasetConverter class. The HFDatasetConverter class currently doesn’t support full Podium functionality and will not work with other components in the library.
- Returns
Dictionary that maps dataset names to HFDatasetConverter instances.
- Return type
dict(str, Union[HFDatasetConverter, podium.Dataset])
- Raises
TypeError – If the given argument is not a dictionary.
CoNLLUDataset¶
-
class
podium.datasets.
CoNLLUDataset
(file_path, fields=None)[source]¶ A CoNLL-U dataset class.
This class uses all default CoNLL-U fields.
Dataset constructor.
- Parameters
file_path (str) – Path to the file containing the dataset.
fields (Dict[str, Field]) – Dictionary that maps the CoNLL-U field name to the field. If passed None the default set of fields will be used.
Built-in datasets (EN)¶
Stanford Sentiment Treebank¶
-
class
podium.datasets.impl.
SST
(file_path, fields, fine_grained=False, subtrees=False)[source]¶ The Stanford sentiment treebank dataset.
-
NAME
¶ dataset name
- Type
str
-
URL
¶ url to the SST dataset
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TEXT_FIELD_NAME
¶ name of the field containing comment text
- Type
str
-
LABEL_FIELD_NAME
¶ name of the field containing label value
- Type
str
-
POSITIVE_LABEL
¶ positive sentiment label
- Type
int
-
NEGATIVE_LABEL
¶ negative sentiment label
- Type
int
Dataset constructor. User should use static method get_dataset_splits rather than using the constructor directly.
- Parameters
dir_path (str) – path to the directory containing datasets
fields (dict(str, Field)) – dictionary that maps field name to the field
fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.
-
static
get_dataset_splits
(fields=None, fine_grained=False, subtrees=False)[source]¶ Method loads and creates dataset splits for the SST dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
. User should use default field names defined in class attributes.fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.
- Returns
(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset
- Return type
-
Internet Movie DataBase¶
-
class
podium.datasets.impl.
IMDB
(dir_path, fields)[source]¶ Simple Imdb dataset with only supervised data which uses non processed data.
-
NAME
¶ dataset name
- Type
str
-
URL
¶ url to the imdb dataset
- Type
str
-
DATASET_DIR
¶ name of the folder in the dataset containing train and test directories
- Type
str
-
ARCHIVE_TYPE
¶ string that defines archive type, used for unpacking dataset
- Type
str
-
TRAIN_DIR
¶ name of the training directory
- Type
str
-
TEST_DIR
¶ name of the directory containing test examples
- Type
str
-
POSITIVE_LABEL_DIR
¶ name of the subdirectory containing examples with positive sentiment
- Type
str
-
NEGATIVE_LABEL_DIR
¶ name of the subdirectory containing examples with negative sentiment
- Type
str
-
TEXT_FIELD_NAME
¶ name of the field containing comment text
- Type
str
-
LABEL_FIELD_NAME
¶ name of the field containing label value
- Type
str
-
POSITIVE_LABEL
¶ positive sentiment label
- Type
int
-
NEGATIVE_LABEL
¶ negative sentiment label
- Type
int
Dataset constructor. User should use static method get_dataset_splits rather than using directly constructor.
- Parameters
dir_path (str) – path to the directory containing datasets
fields (dict(str, Field)) – dictionary that maps field name to the field
-
static
get_dataset_splits
(fields=None)[source]¶ Method creates train and test dataset for Imdb dataset.
- Parameters
fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use
`get_default_fields`
. User should use default field names defined in class attributes.- Returns
(train_dataset, test_dataset) – tuple containing train dataset and test dataset
- Return type
-
Stanford Natural Language Inference¶
-
class
podium.datasets.impl.
SNLI
(file_path, fields)[source]¶ A SNLI dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.
-
NAME
¶ Name of the Dataset.
- Type
str
-
URL
¶ URL to the SNLI dataset.
- Type
str
-
DATASET_DIR
¶ Name of the directory in which the dataset files are stored.
- Type
str
-
ARCHIVE_TYPE
¶ Archive type, i.e. compression method used for archiving the downloaded dataset file.
- Type
str
-
TRAIN_FILE_NAME
¶ Name of the file in which the train dataset is stored.
- Type
str
-
TEST_FILE_NAME
¶ Name of the file in which the test dataset is stored.
- Type
str
-
DEV_FILE_NAME
¶ Name of the file in which the dev (validation) dataset is stored.
- Type
str
-
ANNOTATOR_LABELS_FIELD_NAME
¶ Name of the field containing annotator labels
- Type
str
-
CAPTION_ID_FIELD_NAME
¶ Name of the field containing caption ID
- Type
str
-
GOLD_LABEL_FIELD_NAME
¶ Name of the field containing gold label
- Type
str
-
PAIR_ID_FIELD_NAME
¶ Name of the field containing pair ID
- Type
str
-
SENTENCE1_FIELD_NAME
¶ Name of the field containing sentence1
- Type
str
-
SENTENCE1_PARSE_FIELD_NAME
¶ Name of the field containing sentence1 parse
- Type
str
-
SENTENCE1_BINARY_PARSE_FIELD_NAME
¶ Name of the field containing sentence1 binary parse
- Type
str
-
SENTENCE2_FIELD_NAME
¶ Name of the field containing sentence2
- Type
str
-
SENTENCE2_PARSE_FIELD_NAME
¶ Name of the field containing sentence2 parse
- Type
str
-
SENTENCE2_BINARY_PARSE_FIELD_NAME
¶ Name of the field containing sentence2 binary parse
- Type
str
Dataset constructor. This method should not be used directly, get_train_test_dev_dataset should be used instead.
- Parameters
file_path (str) – Path to the .jsonl file containing the dataset.
fields (dict(str, Field)) – A dictionary that maps field names to Field objects.
-
static
get_default_fields
()[source]¶ Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse.
- Returns
fields – Dictionary mapping field names to respective Fields.
- Return type
dict(str, Field)
Notes
This dataset includes both parses for every sentence,
-
static
get_train_test_dev_dataset
(fields=None)[source]¶ Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.
- Parameters
fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied,
`get_default_fields`
is used.- Returns
(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
- Return type
-
Cornell Movie Dialogs¶
-
class
podium.datasets.impl.
CornellMovieDialogs
(data, fields=None)[source]¶ Cornell Movie Dialogs dataset which contains sentences and replies from movies.
Dataset constructor.
- Parameters
data (CornellMovieDialogsNamedTuple) – cornell movie dialogs data
fields (dict(str : Field)) – dictionary that maps field name to the field
- Raises
ValueError – If given data is None.