Dataset classes

Dataset

class podium.Dataset(examples, fields, sort_key=None)[source]

A general purpose container for datasets.

A dataset is a shallow wrapper for a list of Example instances which contain the dataset data as well as the corresponding Field instances, which (pre)process the columns of each example.

Creates a dataset with the given examples and their fields.

Parameters
  • examples (list) – A list of examples.

  • fields (list) – A list of fields that the examples have been created with.

  • sort_key (callable) – A key to use for sorting dataset examples, used for batching together examples with similar lengths to minimize padding.

__getattr__(field)

Returns an Iterator iterating over values of the field with the given name for every example in the dataset.

Parameters

field_name (str) – The name of the field whose values are to be returned.

Returns

An iterable over values of the referenced Field for every dataset Example

Return type

iterable

Raises

AttributeError – If there is no Field with the given name.

__getitem__(i)[source]

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing
                            # the indexed examples.
Parameters

i (int or slice or iterable) – Index used to index examples.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

__iter__()[source]

Iterates over all examples in the dataset in order.

Yields

example – Yields examples in the dataset.

__len__()[source]

Returns the number of examples in the dataset.

Returns

The number of examples in the dataset.

Return type

int

as_dict(include_raw=False)

Converts the entire dataset to a dictionary, where the field names map to lists of processed Examples.

Parameters

include_raw (bool) – A flag denoting whether raw data should be included in the output dictionary, which can be used to debug your preprocessing pipeline. Defaults to False.

Returns

The entire dataset as a python dict. Field names are keys, values are lists of (raw, tokenized) data if include_raw is set to True or tokenized data otherwise.

Return type

dataset_as_dict

batch(add_padding=False)

Creates an input and target batch containing the whole dataset. The format of the batch is the same as the batches returned by the.

Parameters

add_padding (bool) – A flag indicating whether the dataset should be padded when returned as a single batch. Please note that setting this argument to True can consume a large amount of memory since the dataset will be expanded to [num_instances, max_size] as every instance in the dataset needs to be padded to the size of the longest one. Defaults to False

Returns

Two objects containing the input and target batches over the whole dataset.

Return type

input_batch, target_batch

field(name)

Returns the Field in a dataset with a given name, if it exists.

Returns

The referenced Field or None if no Field with the given name exists.

Return type

Field or None

filter(predicate, inplace=False)[source]

Method filters examples with given predicate.

Parameters
  • predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false

  • inplace (bool, default False) – if True, do operation inplace and return None

filtered(predicate)[source]

Filters examples with given predicate and returns a new DatasetBase instance containing those examples.

Parameters

predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false

Returns

A new DatasetBase instance containing only the Examples for which predicate returned True.

Return type

DatasetBase

finalize_fields(*datasets)

Builds vocabularies of all the non-eager fields in the dataset, from the Dataset objects given as *args and then finalizes all the fields.

Parameters

*datasets – A variable number of DatasetBase objects from which to build the vocabularies for non-eager fields. If none provided, the vocabularies are built from this Dataset (self).

static from_dataset(dataset)[source]

Creates an Dataset instance from a podium.datasets.DatasetBase instance.

Parameters

dataset (DatasetBase) – DatasetBase instance to be used to create the Dataset.

Returns

Dataset instance created from the passed DatasetBase instance.

Return type

Dataset

classmethod from_pandas(df, fields, index_field=None)[source]

Creates a Dataset instance from a pandas Dataframe.

Parameters
  • df (pandas.Dataframe) – Pandas dataframe from which data will be taken.

  • fields (Union[Dict[str, Field], List[Field]]) –

    A mapping from dataframe columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

    A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

    If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the dataframe).

    If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored.

  • index_field (Optional[Field]) – Field which will be used to process the index column of the Dataframe. If None, the index column will be ignored.

Returns

Dataset containing data from the Dataframe

Return type

Dataset

get(i, deep_copy=False)[source]

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Example:

# Indexing by a single integers returns a single example
example = dataset.get(1)

# Same as the first example, but returns a deep_copy of the example
example_copy = dataset.get(1, deep_copy=True)

# Multi-indexing returns a new dataset containing the indexed examples
s = slice(1, 10)
new_dataset = dataset.get(s)

new_dataset_copy = dataset.get(s, deep_copy=True)
Parameters
  • i (int or slice or iterable) – Index used to index examples.

  • deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

numericalize_examples()[source]

Generates and caches numericalized data for every example in the dataset.

Call before using the dataset to precompute and cache numericalized values and avoid lazy numericalization during iteration. The main difference when calling this class is that the speed of the first epoch will be consistent with each subsequent one.

shuffle_examples(random_state=None)[source]

Shuffles the examples in this dataset in-place.

Parameters

random_state (int) – The random seed used for shuffling.

shuffled()

Creates a new DatasetBase instance containing all Examples, but in shuffled order.

Returns

A new DatasetBase instance containing all Examples, but in shuffled order.

Return type

DatasetBase

sorted(key, reverse=False)

Creates a new DatasetBase instance in which all Examples are sorted according to the value returned by key.

Parameters
  • key (callable) – specifies a function of one argument that is used to extract a comparison key from each Example.

  • reverse (bool) – If set to True, then the list elements are sorted as if each comparison were reversed.

Returns

A new DatasetBase instance with sorted Examples.

Return type

DatasetBase

split(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)[source]

Creates train-(validation)-test splits from this dataset.

The splits are new Dataset objects, each containing a part of this one’s examples.

Parameters
  • split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).

  • stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.

  • strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.

  • random_state (int) – The random seed used for shuffling.

Returns

Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.

Return type

tuple[Dataset]

Raises

ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.

to_pandas(include_raw=False)

Creates a pandas dataframe containing all data from this Dataset. By default, only processed data is kept in the dataframe.

If include_raw is True, raw data will also be stored under the column name {field name}_raw, e.g. for a field called ‘Text’, the raw data column name would be Text_raw.

When making pandas dataframes form big DiskBackedDatasets, care should be taken to avoid overusing memory, as the whole dataset is loaded in to memory.

Parameters

include_raw (bool) – Whether to include raw data in the dataframe.

Returns

Pandas dataframe containing all examples from this Dataset.

Return type

DataFrame

property examples

List containing all Examples.

property field_dict

Dictionary mapping the Dataset’s field names to the respective Fields.

property fields

List containing all fields of this dataset.

TabularDataset

class podium.datasets.TabularDataset(path, fields, format='csv', line2example=None, skip_header=False, csv_reader_params={}, **kwargs)[source]

A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.

Creates a TabularDataset from a file containing the data rows and an object containing all the fields that we are interested in.

Parameters
  • path (str) – Path to the data file.

  • fields ((list | dict)) –

    A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

    A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

    If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.

    If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).

  • format (str) – The format of the data file. Has to be either “CSV”, “TSV”, “JSON” (case-insensitive). Ignored if line2example is not None. Defaults to “CSV”.

  • line2example (callable) – The function mapping from a file line to Fields. In case your dataset is not in one of the standardized formats, you can provide a function which performs a custom split for each input line.

  • skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.

  • delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.

  • csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

Raises

ValueError – If the format given is not one of “CSV”, “TSV” or “JSON” and line2example is not set. If fields given as a dict and skip_header is True. If format is “JSON” and skip_header is True.

DiskBackedDataset

class podium.datasets.arrow.DiskBackedDataset(table, fields, cache_path, mmapped_file, data_types=None)[source]

Podium dataset implementation which uses PyArrow as its data storage backend.

Examples are stored in a file which is then memory mapped for fast random access.

Creates a new DiskBackedDataset instance. Users should use static constructor functions like ‘from_dataset’ to construct new DiskBackedDataset instances.

Parameters
  • table (pyarrow.Table) – Table object that contains example data.

  • fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the examples in ‘table’.

  • cache_path (Optional[str]) – Path to the directory where the cache file is saved.

  • mmapped_file (pyarrow.MemoryMappedFile) – Open MemoryMappedFile descriptor of the cache file.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

__getitem__(item)[source]

Returns an example or a new DiskBackedDataset containing the indexed examples. If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10:2] # Multiindexing returns a new dataset containing

# the indexed examples.

new_dataset_2 = dataset[(1,5,6,9)] # Returns a new dataset containing the

# indexed Examples.

Parameters

item (int or slice or iterable) – Index used to index examples.

Returns

If i is an int, a single example will be returned. If i is a slice, list or tuple, a copy of this dataset containing only the indexed examples will be returned.

Return type

Example or Dataset

__iter__()[source]

Iterates over Examples in this dataset.

Returns

Iterator over all the examples in this dataset.

Return type

Iterator[Example]

__len__()[source]

Returns the number of Examples in this Dataset.

Returns

The number of Examples in this Dataset.

Return type

int

close()[source]

Closes resources held by the DiskBackedDataset.

Only closes the cache file handle. The cache will not be deleted from disk. For cache deletion, use delete_cache.

delete_cache()[source]

Deletes the cache directory and all cache files belonging to this dataset.

After this call is executed, any DiskBackedDataset created by slicing/indexing this dataset and any view over this dataset will not be usable any more. Any dataset created from this dataset should be dumped to a new directory before calling this method.

dump_cache(cache_path=None)[source]

Saves this dataset at cache_path. Dumped datasets can be loaded with the DiskBackedDataset.load_cache static method. All fields contained in this dataset must be serializable using pickle.

Parameters

cache_path (Optional[str]) –

Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.

Returns

The chosen cache directory path. Useful when cache_path is None and a temporary directory is created.

Return type

str

static from_dataset(dataset, cache_path=None, data_types=None)[source]

Creates a DiskBackedDataset instance from a podium.datasets.DatasetBase instance.

Parameters
  • dataset (DatasetBase) – DatasetBase instance to be used to create the DiskBackedDataset.

  • cache_path (Optional[str]) –

    Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

    If None, a temporary directory will be created.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

Returns

DiskBackedDataset instance created from the passed DatasetBase instance.

Return type

DiskBackedDataset

static from_examples(fields, examples, cache_path=None, data_types=None, chunk_size=1024)[source]

Creates a DiskBackedDataset from the provided Examples.

Parameters
  • fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the Examples.

  • examples (Iterable[Example]) – Iterable of examples.

  • cache_path (Optional[str]) –

    Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

    If None, a temporary directory will be created.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

  • chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.

Returns

DiskBackedDataset instance created from the passed Examples.

Return type

DiskBackedDataset

classmethod from_pandas(df, fields, index_field=None, cache_path=None, data_types=None, chunk_size=1024)[source]

Creates a DiskBackedDataset instance from a pandas Dataframe.

Parameters
  • df (pandas.Dataframe) – Pandas dataframe from which data will be taken.

  • fields (Union[Dict[str, Field], List[Field]]) –

    A mapping from dataframe columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

    A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

    If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the dataframe).

    If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored.

  • index_field (Optional[Field]) – Field which will be used to process the index column of the Dataframe. If None, the index column will be ignored.

  • cache_path (Optional[str]) –

    Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

    If None, a temporary directory will be created.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

  • chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.

Returns

Dataset containing data from the Dataframe

Return type

Dataset

static from_tabular_file(path, format, fields, cache_path=None, data_types=None, chunk_size=10000, skip_header=False, delimiter=None, csv_reader_params=None)[source]

Loads a tabular file format (csv, tsv, json) as a DiskBackedDataset.

Parameters
  • path (str) – Path to the data file.

  • format (str) – The format of the data file. Has to be either “CSV”, “TSV”, or “JSON” (case-insensitive).

  • fields (Union[Dict[str, Field], List[Field]]) –

    A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

    A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

    If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.

    If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).

  • cache_path (Optional[str]) –

    Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

    If None, a temporary directory will be created.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

  • chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.

  • skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.

  • delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.

  • csv_reader_params (Dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

Returns

DiskBackedDataset instance containing the examples from the tabular file.

Return type

DiskBackedDataset

static load_cache(cache_path)[source]

Loads a cached DiskBackedDataset contained in the cache_path directory. Fields will be loaded into memory but the Example data will be memory mapped avoiding unnecessary memory usage.

Parameters

cache_path (Optional[str]) –

Path to the directory where the cache file will be saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.

Returns

the DiskBackedDataset loaded from the passed cache directory.

Return type

DiskBackedDataset

HFDatasetConverter

class podium.datasets.hf.HFDatasetConverter(hf_dataset, fields=None)[source]

Class for converting rows from the HuggingFace Datasets to podium.Examples.

HFDatasetConverter constructor.

Parameters
  • hf_dataset (datasets.Dataset) – HuggingFace Dataset.

  • fields (dict(str, podium.Field)) – Dictionary that maps a column name of the dataset to a podium.Field. If passed None the default feature conversion rules will be used to build a dictonary from the dataset features.

Raises

TypeError – If dataset is not an instance of datasets.Dataset.

__iter__()[source]

Iterate through the dataset and convert the examples.

as_dataset()[source]

Convert the original HuggingFace dataset to a podium.Dataset.

Returns

podium.Dataset instance.

Return type

podium.Dataset

static from_dataset_dict(dataset_dict, fields=None, cast_to_podium=False)[source]

Copies the keys of given dictionary and converts the corresponding HuggingFace Datasets to the HFDatasetConverter instances.

Parameters
  • dataset_dict (dict(str, datasets.Dataset)) – Dictionary that maps dataset names to HuggingFace Datasets.

  • cast_to_podium (bool) – Determines whether to immediately convert the HuggingFace dataset to Podium dataset (if True), or shallowly wrap the HuggingFace dataset in the HFDatasetConverter class. The HFDatasetConverter class currently doesn’t support full Podium functionality and will not work with other components in the library.

Returns

Dictionary that maps dataset names to HFDatasetConverter instances.

Return type

dict(str, Union[HFDatasetConverter, podium.Dataset])

Raises

TypeError – If the given argument is not a dictionary.

CoNLLUDataset

class podium.datasets.CoNLLUDataset(file_path, fields=None)[source]

A CoNLL-U dataset class.

This class uses all default CoNLL-U fields.

Dataset constructor.

Parameters
  • file_path (str) – Path to the file containing the dataset.

  • fields (Dict[str, Field]) – Dictionary that maps the CoNLL-U field name to the field. If passed None the default set of fields will be used.

static get_default_fields()[source]

Method returns a dict of default CoNLL-U fields.

Returns

fields – Dict containing all default CoNLL-U fields.

Return type

Dict[str, Field]

Built-in datasets (EN)

Stanford Sentiment Treebank

class podium.datasets.impl.SST(file_path, fields, fine_grained=False, subtrees=False)[source]

The Stanford sentiment treebank dataset.

NAME

dataset name

Type

str

URL

url to the SST dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

Dataset constructor. User should use static method get_dataset_splits rather than using the constructor directly.

Parameters
  • dir_path (str) – path to the directory containing datasets

  • fields (dict(str, Field)) – dictionary that maps field name to the field

  • fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.

  • subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

static get_dataset_splits(fields=None, fine_grained=False, subtrees=False)[source]

Method loads and creates dataset splits for the SST dataset.

Parameters
  • fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

  • fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.

  • subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

Returns

(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset

Return type

(Dataset, Dataset, Dataset)

static get_default_fields()[source]

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

Internet Movie DataBase

class podium.datasets.impl.IMDB(dir_path, fields)[source]

Simple Imdb dataset with only supervised data which uses non processed data.

NAME

dataset name

Type

str

URL

url to the imdb dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

POSITIVE_LABEL_DIR

name of the subdirectory containing examples with positive sentiment

Type

str

NEGATIVE_LABEL_DIR

name of the subdirectory containing examples with negative sentiment

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

Dataset constructor. User should use static method get_dataset_splits rather than using directly constructor.

Parameters
  • dir_path (str) – path to the directory containing datasets

  • fields (dict(str, Field)) – dictionary that maps field name to the field

static get_dataset_splits(fields=None)[source]

Method creates train and test dataset for Imdb dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

static get_default_fields()[source]

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

Stanford Natural Language Inference

class podium.datasets.impl.SNLI(file_path, fields)[source]

A SNLI dataset class. Unlike SNLISimple, this class includes all the fields included in the SNLI dataset by default.

NAME

Name of the Dataset.

Type

str

URL

URL to the SNLI dataset.

Type

str

DATASET_DIR

Name of the directory in which the dataset files are stored.

Type

str

ARCHIVE_TYPE

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type

str

TRAIN_FILE_NAME

Name of the file in which the train dataset is stored.

Type

str

TEST_FILE_NAME

Name of the file in which the test dataset is stored.

Type

str

DEV_FILE_NAME

Name of the file in which the dev (validation) dataset is stored.

Type

str

ANNOTATOR_LABELS_FIELD_NAME

Name of the field containing annotator labels

Type

str

CAPTION_ID_FIELD_NAME

Name of the field containing caption ID

Type

str

GOLD_LABEL_FIELD_NAME

Name of the field containing gold label

Type

str

PAIR_ID_FIELD_NAME

Name of the field containing pair ID

Type

str

SENTENCE1_FIELD_NAME

Name of the field containing sentence1

Type

str

SENTENCE1_PARSE_FIELD_NAME

Name of the field containing sentence1 parse

Type

str

SENTENCE1_BINARY_PARSE_FIELD_NAME

Name of the field containing sentence1 binary parse

Type

str

SENTENCE2_FIELD_NAME

Name of the field containing sentence2

Type

str

SENTENCE2_PARSE_FIELD_NAME

Name of the field containing sentence2 parse

Type

str

SENTENCE2_BINARY_PARSE_FIELD_NAME

Name of the field containing sentence2 binary parse

Type

str

Dataset constructor. This method should not be used directly, get_train_test_dev_dataset should be used instead.

Parameters
  • file_path (str) – Path to the .jsonl file containing the dataset.

  • fields (dict(str, Field)) – A dictionary that maps field names to Field objects.

static get_default_fields()[source]

Method returns all SNLI fields in the following order: annotator_labels, captionID, gold_label, pairID, sentence1, sentence1_parse, sentence1_binary_parse, sentence2, sentence2_parse, sentence2_binary_parse.

Returns

fields – Dictionary mapping field names to respective Fields.

Return type

dict(str, Field)

Notes

This dataset includes both parses for every sentence,

static get_train_test_dev_dataset(fields=None)[source]

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters

fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.

Returns

(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.

Return type

(Dataset, Dataset, Dataset)

Cornell Movie Dialogs

class podium.datasets.impl.CornellMovieDialogs(data, fields=None)[source]

Cornell Movie Dialogs dataset which contains sentences and replies from movies.

Dataset constructor.

Parameters
  • data (CornellMovieDialogsNamedTuple) – cornell movie dialogs data

  • fields (dict(str : Field)) – dictionary that maps field name to the field

Raises

ValueError – If given data is None.

static get_default_fields()[source]

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)