Dataset classes

Dataset

class podium.datasets.Dataset(examples, fields, sort_key=None)[source]

A general purpose container for datasets. A dataset is a shallow wrapper for a list of Example classes which store the instance data as well as the corresponding Field classes, which process the columns of each example.

examples

A list containing the instances of the dataset as Example classes.

Type

list

fields

A list of Field objects defining preprocessing for data fields of the dataset.

Type

list

Creates a dataset with the given examples and their fields.

Parameters
  • examples (list) – A list of examples.

  • fields (list) – A list of fields that the examples have been created with.

  • sort_key (callable) – A key to use for sorting dataset examples, used for batching together examples with similar lengths to minimize padding.

__getitem__(i)[source]

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing

# the indexed examples.

Parameters

i (int or slice or iterable) – Index used to index examples.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

__getstate__()[source]

Method obtains dataset state. It is used for pickling dataset data to file.

Returns

state – dataset state dictionary

Return type

dict

__iter__()[source]

Iterates over all examples in the dataset in order.

Yields

example – Yields examples in the dataset.

__len__()[source]

Returns the number of examples in the dataset.

Returns

The number of examples in the dataset.

Return type

int

__setstate__(state)[source]

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters

state (dict) – dataset state dictionary

filter(predicate, inplace=False)[source]

Method filters examples with given predicate.

Parameters
  • predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false

  • inplace (bool, default False) – if True, do operation inplace and return None

filtered(predicate)[source]

Filters examples with given predicate and returns a new DatasetBase instance containing those examples.

Parameters

predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false

Returns

A new DatasetBase instance containing only the Examples for which predicate returned True.

Return type

DatasetBase

static from_dataset(dataset)[source]

Creates an Dataset instance from a podium.datasets.DatasetBase instance.

Parameters

dataset (DatasetBase) – DatasetBase instance to be used to create the Dataset.

Returns

Dataset instance created from the passed DatasetBase instance.

Return type

Dataset

get(i, deep_copy=False)[source]

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Example:

# Indexing by a single integers returns a single example
example = dataset.get(1)

# Same as the first example, but returns a deep_copy of the example
example_copy = dataset.get(1, deep_copy=True)

# Multi-indexing returns a new dataset containing the indexed examples
s = slice(1, 10)
new_dataset = dataset.get(s)

new_dataset_copy = dataset.get(s, deep_copy=True)
Parameters
  • i (int or slice or iterable) – Index used to index examples.

  • deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

numericalize_examples()[source]

Generates and caches numericalized data for every example in the dataset.

Call before using the dataset to avoid lazy numericalization during iteration.

shuffle_examples(random_state=None)[source]

Shuffles the examples in this dataset.

Parameters

random_state (int) – The random seed used for shuffling.

split(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)[source]

Creates train-(validation)-test splits from this dataset.

The splits are new Dataset objects, each containing a part of this one’s examples.

Parameters
  • split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).

  • stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.

  • strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.

  • random_state (int) – The random seed used for shuffling.

Returns

Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.

Return type

tuple[Dataset]

Raises

ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.

TabularDataset

class podium.datasets.TabularDataset(path, fields, format='csv', line2example=None, skip_header=False, csv_reader_params={}, **kwargs)[source]

A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.

Creates a TabularDataset from a file containing the data rows and an object containing all the fields that we are interested in.

Parameters
  • path (str) – Path to the data file.

  • fields ((list | dict)) –

    A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

    A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

    If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.

    If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).

  • format (str) – The format of the data file. Has to be either “CSV”, “TSV”, “JSON” (case-insensitive). Ignored if line2example is not None. Defaults to “CSV”.

  • line2example (callable) – The function mapping from a file line to Fields. In case your dataset is not in one of the standardized formats, you can provide a function which performs a custom split for each input line.

  • skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.

  • delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.

  • csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

Raises

ValueError – If the format given is not one of “CSV”, “TSV” or “JSON” and line2example is not set. If fields given as a dict and skip_header is True. If format is “JSON” and skip_header is True.

DiskBackedDataset

class podium.datasets.arrow.DiskBackedDataset(table, fields, cache_path, mmapped_file, data_types=None)[source]

Podium dataset implementation which uses PyArrow as its data storage backend.

Examples are stored in a file which is then memory mapped for fast random access.

Creates a new DiskBackedDataset instance. Users should use static constructor functions like ‘from_dataset’ to construct new DiskBackedDataset instances.

Parameters
  • table (pyarrow.Table) – Table object that contains example data.

  • fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the examples in ‘table’.

  • cache_path (Optional[str]) – Path to the directory where the cache file is saved.

  • mmapped_file (pyarrow.MemoryMappedFile) – Open MemoryMappedFile descriptor of the cache file.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

__getitem__(item)[source]

Returns an example or a new DiskBackedDataset containing the indexed examples. If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10:2] # Multiindexing returns a new dataset containing

# the indexed examples.

new_dataset_2 = dataset[(1,5,6,9)] # Returns a new dataset containing the

# indexed Examples.

Parameters

item (int or slice or iterable) – Index used to index examples.

Returns

If i is an int, a single example will be returned. If i is a slice, list or tuple, a copy of this dataset containing only the indexed examples will be returned.

Return type

Example or Dataset

__iter__()[source]

Iterates over Examples in this dataset.

Returns

Iterator over all the examples in this dataset.

Return type

Iterator[Example]

__len__()[source]

Returns the number of Examples in this Dataset.

Returns

The number of Examples in this Dataset.

Return type

int

close()[source]

Closes resources held by the DiskBackedDataset.

Only closes the cache file handle. The cache will not be deleted from disk. For cache deletion, use delete_cache.

delete_cache()[source]

Deletes the cache directory and all cache files belonging to this dataset.

After this call is executed, any DiskBackedDataset created by slicing/indexing this dataset and any view over this dataset will not be usable any more. Any dataset created from this dataset should be dumped to a new directory before calling this method.

dump_cache(cache_path=None)[source]

Saves this dataset at cache_path. Dumped datasets can be loaded with the DiskBackedDataset.load_cache static method. All fields contained in this dataset must be serializable using pickle.

Parameters

cache_path (Optional[str]) –

Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.

Returns

The chosen cache directory path. Useful when cache_path is None and a temporary directory is created.

Return type

str

static from_dataset(dataset, cache_path=None, data_types=None)[source]

Creates a DiskBackedDataset instance from a podium.datasets.DatasetBase instance.

Parameters
  • dataset (DatasetBase) – DatasetBase instance to be used to create the DiskBackedDataset.

  • cache_path (Optional[str]) –

    Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

    If None, a temporary directory will be created.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

Returns

DiskBackedDataset instance created from the passed DatasetBase instance.

Return type

DiskBackedDataset

static from_examples(fields, examples, cache_path=None, data_types=None, chunk_size=1024)[source]

Creates a DiskBackedDataset from the provided Examples.

Parameters
  • fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the Examples.

  • examples (Iterable[Example]) – Iterable of examples.

  • cache_path (Optional[str]) –

    Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

    If None, a temporary directory will be created.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

  • chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.

Returns

DiskBackedDataset instance created from the passed Examples.

Return type

DiskBackedDataset

static from_tabular_file(path, format, fields, cache_path=None, data_types=None, chunk_size=10000, skip_header=False, delimiter=None, csv_reader_params=None)[source]

Loads a tabular file format (csv, tsv, json) as a DiskBackedDataset.

Parameters
  • path (str) – Path to the data file.

  • format (str) – The format of the data file. Has to be either “CSV”, “TSV”, or “JSON” (case-insensitive).

  • fields (Union[Dict[str, Field], List[Field]]) –

    A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

    A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

    If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.

    If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).

  • cache_path (Optional[str]) –

    Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

    If None, a temporary directory will be created.

  • data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

  • chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.

  • skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.

  • delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.

  • csv_reader_params (Dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

Returns

DiskBackedDataset instance containing the examples from the tabular file.

Return type

DiskBackedDataset

static load_cache(cache_path)[source]

Loads a cached DiskBackedDataset contained in the cache_path directory. Fields will be loaded into memory but the Example data will be memory mapped avoiding unnecessary memory usage.

Parameters

cache_path (Optional[str]) –

Path to the directory where the cache file will be saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.

Returns

the DiskBackedDataset loaded from the passed cache directory.

Return type

DiskBackedDataset

HuggingFaceDatasetConverter

class podium.datasets.hf.HFDatasetConverter(hf_dataset, fields=None)[source]

Class for converting rows from the HuggingFace Datasets to podium.Examples.

HFDatasetConverter constructor.

Parameters
  • hf_dataset (datasets.Dataset) – HuggingFace Dataset.

  • fields (dict(str, podium.Field)) – Dictionary that maps a column name of the dataset to a podium.Field. If passed None the default feature conversion rules will be used to build a dictonary from the dataset features.

Raises

TypeError – If dataset is not an instance of datasets.Dataset.

__iter__()[source]

Iterate through the dataset and convert the examples.

as_dataset()[source]

Convert the original HuggingFace dataset to a podium.Dataset.

Returns

podium.Dataset instance.

Return type

podium.Dataset

static from_dataset_dict(dataset_dict, fields=None, cast_to_podium=False)[source]

Copies the keys of given dictionary and converts the corresponding HuggingFace Datasets to the HFDatasetConverter instances.

Parameters
  • dataset_dict (dict(str, datasets.Dataset)) – Dictionary that maps dataset names to HuggingFace Datasets.

  • cast_to_podium (bool) – Determines whether to immediately convert the HuggingFace dataset to Podium dataset (if True), or shallowly wrap the HuggingFace dataset in the HFDatasetConverter class. The HFDatasetConverter class currently doesn’t support full Podium functionality and will not work with other components in the library.

Returns

Dictionary that maps dataset names to HFDatasetConverter instances.

Return type

dict(str, Union[HFDatasetConverter, podium.Dataset])

Raises

TypeError – If the given argument is not a dictionary.

CoNLLUDataset

class podium.datasets.CoNLLUDataset(file_path, fields=None)[source]

A CoNLL-U dataset class.

This class uses all default CoNLL-U fields.

Dataset constructor.

Parameters
  • file_path (str) – Path to the file containing the dataset.

  • fields (Dict[str, Field]) – Dictionary that maps the CoNLL-U field name to the field. If passed None the default set of fields will be used.

static get_default_fields()[source]

Method returns a dict of default CoNLL-U fields.

Returns

fields – Dict containing all default CoNLL-U fields.

Return type

Dict[str, Field]

Built-in datasets (EN)

Stanford Sentiment Treebank

class podium.datasets.impl.SST(file_path, fields, fine_grained=False, subtrees=False)[source]

The Stanford sentiment treebank dataset.

NAME

dataset name

Type

str

URL

url to the SST dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

Dataset constructor. User should use static method get_dataset_splits rather than using the constructor directly.

Parameters
  • dir_path (str) – path to the directory containing datasets

  • fields (dict(str, Field)) – dictionary that maps field name to the field

  • fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.

  • subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

static get_dataset_splits(fields=None, fine_grained=False, subtrees=False)[source]

Method loads and creates dataset splits for the SST dataset.

Parameters
  • fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

  • fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.

  • subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

Returns

(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset

Return type

(Dataset, Dataset, Dataset)

static get_default_fields()[source]

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

Internet Movie DataBase

class podium.datasets.impl.IMDB(dir_path, fields)[source]

Simple Imdb dataset with only supervised data which uses non processed data.

NAME

dataset name

Type

str

URL

url to the imdb dataset

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

POSITIVE_LABEL_DIR

name of the subdirectory containing examples with positive sentiment

Type

str

NEGATIVE_LABEL_DIR

name of the subdirectory containing examples with negative sentiment

Type

str

TEXT_FIELD_NAME

name of the field containing comment text

Type

str

LABEL_FIELD_NAME

name of the field containing label value

Type

str

POSITIVE_LABEL

positive sentiment label

Type

int

NEGATIVE_LABEL

negative sentiment label

Type

int

Dataset constructor. User should use static method get_dataset_splits rather than using directly constructor.

Parameters
  • dir_path (str) – path to the directory containing datasets

  • fields (dict(str, Field)) – dictionary that maps field name to the field

static get_dataset_splits(fields=None)[source]

Method creates train and test dataset for Imdb dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

static get_default_fields()[source]

Method returns default Imdb fields: text and label.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

Stanford Natural Language Inference

class podium.datasets.impl.SNLISimple(file_path, fields)[source]

A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.

NAME

Name of the Dataset.

Type

str

URL

URL to the SNLI dataset.

Type

str

DATASET_DIR

Name of the directory in which the dataset files are stored.

Type

str

ARCHIVE_TYPE

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type

str

TRAIN_FILE_NAME

Name of the file in which the train dataset is stored.

Type

str

TEST_FILE_NAME

Name of the file in which the test dataset is stored.

Type

str

DEV_FILE_NAME

Name of the file in which the dev (validation) dataset is stored.

Type

str

GOLD_LABEL_FIELD_NAME

Name of the field containing gold label

Type

str

SENTENCE1_FIELD_NAME

Name of the field containing sentence1

Type

str

SENTENCE2_FIELD_NAME

Name of the field containing sentence2

Type

str

Dataset constructor. This method should not be used directly, get_train_test_dev_dataset should be used instead.

Parameters
  • file_path (str) – Path to the .jsonl file containing the dataset.

  • fields (dict(str, Field)) – A dictionary that maps field names to Field objects.

static get_default_fields()[source]

Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2.

Returns

fields – Dictionary mapping field names to respective Fields.

Return type

dict(str, Field)

static get_train_test_dev_dataset(fields=None)[source]

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters

fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.

Returns

(train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.

Return type

(Dataset, Dataset, Dataset)

Eurovoc Dataset

class podium.datasets.impl.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)[source]

EuroVoc dataset class that contains labeled documents and the label hierarchy.

Dataset constructor.

Parameters
  • eurovoc_labels (dict(int : Label)) – dictionary mapping eurovoc label_ids to labels

  • crovoc_labels (dict(int : Label)) – dictionary mapping crovoc label_ids to labels

  • documents (list(Document)) – list of all documents in dataset

  • mappings (dict(int : list(int))) – dictionary that maps documents_ids to list of their label_ids

  • fields (dict(str : Field)) – dictionary that maps field name to the field

get_all_ancestors(label_id)[source]

Returns ids of all ancestors of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_crovoc_label_hierarchy()[source]

Returns CroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

static get_default_fields()[source]

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

get_direct_parents(label_id)[source]

Returns ids of direct parents of the label with the given label id.

Parameters

label_id (int) – id of the label

Returns

list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies

Return type

list(int)

get_eurovoc_label_hierarchy()[source]

Returns the EuroVoc label hierarchy.

Returns

dict(int – dictionary that maps label id to label

Return type

Label)

is_ancestor(label_id, example)[source]

Checks if the given label_id is an ancestor of any labels of the example.

Parameters
  • label_id (int) – id of the label

  • example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

Cornell Movie Dialogs Dataset

class podium.datasets.impl.CornellMovieDialogsConversationalDataset(data, fields=None)[source]

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

Dataset constructor.

Parameters
  • data (CornellMovieDialogsNamedTuple) – cornell movie dialogs data

  • fields (dict(str : Field)) – dictionary that maps field name to the field

Raises

ValueError – If given data is None.

static get_default_fields()[source]

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

Built-in datasets (HR)

Pauza Reviews Dataset

class podium.datasets.impl.PauzaHRDataset(dir_path, fields)[source]

Simple PauzaHR dataset class which uses original reviews.

URL

url to the PauzaHR dataset

Type

str

NAME

dataset name

Type

str

DATASET_DIR

name of the folder in the dataset containing train and test directories

Type

str

ARCHIVE_TYPE

string that defines archive type, used for unpacking dataset

Type

str

TRAIN_DIR

name of the training directory

Type

str

TEST_DIR

name of the directory containing test examples

Type

str

Dataset constructor. User should use static method get_train_test_dataset rather than using directly constructor.

Parameters
  • dir_path (str) – path to the directory containing datasets

  • fields (dict(str, Field)) – dictionary that maps field name to the field

static get_default_fields()[source]

Method returns default PauzaHR fields: rating, source and text.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

static get_train_test_dataset(fields=None)[source]

Method creates train and test dataset for PauzaHR dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.

Returns

(train_dataset, test_dataset) – tuple containing train dataset and test dataset

Return type

(Dataset, Dataset)

Named Entity Recognition dataset

class podium.datasets.impl.CroatianNERDataset(tokenized_documents, fields)[source]

Croatian NER dataset.

A single example in the dataset represents a single sentence in the input data.

Dataset constructor. Users should use the static method get_dataset rather than invoking the constructor directly.

Parameters
  • tokenized_documents (list(list(str, str))) – List of tokenized documents. Each document is represented as a list of tuples (token, label). The sentences in document are delimited by tuple (None, None)

  • fields (list(Field)) – Dictionary that maps field name to the field

classmethod get_dataset(tokenizer='split', tag_schema='IOB', fields=None, **kwargs)[source]

Method downloads (if necessary) and loads the dataset.

Parameters
  • tokenizer (str | callable) – Word-level tokenizer used to tokenize the input text

  • tag_schema (str) –

    Tag schema used for constructing the token labels

    supported tag schemas:

    ’IOB’: the label of the beginning token of the entity is prefixed with ‘B-‘, the remaining tokens that belong to the same entity are prefixed with ‘I-‘. The tokens that don’t belong to any named entity are labeled ‘O’

  • fields (dict(str, Field)) – dictionary mapping field names to fields. If set to None, the default fields are used.

  • **kwargs

    SCPLargeResource.SCP_USER_KEY:

    User on the host machine. Not required if the user on the local machine matches the user on the host machine.

    SCPLargeResource.SCP_PRIVATE_KEY:

    Path to the ssh private key eligible to access the host machine. Not required on Unix if the private is in the default location.

    SCPLargeResource.SCP_PASS_KEY:

    Password for the ssh private key (optional). Can be omitted if the private key is not encrypted.

Returns

The loaded dataset.

Return type

CroatianNERDataset

static get_default_fields()[source]

Method returns default Croatian NER dataset fields.

Returns

fields – Dictionary mapping field name to field.

Return type

dict(str, Field)

Catacx Datasets

class podium.datasets.impl.CatacxDataset(dir_path, fields=None)[source]

Catacx dataset.

Dataset constructor, should be given the path to the .json file which contains the Catacx dataset.

Parameters
  • dir_path (str) – path to the file containing the dataset.

  • fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used

static get_dataset(fields=None)[source]

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters

fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.

Returns

The loaded dataset.

Return type

CatacxDataset

static get_default_fields()[source]

Method returns a dict of default Catacx fields.

Returns

fields – dict containing all default Catacx fields

Return type

dict(str, Field)