Dataset classes¶

Dataset¶

class podium.datasets.Dataset(examples, fields, sort_key=None)[source]¶

A general purpose container for datasets. A dataset is a shallow wrapper for a list of Example classes which store the instance data as well as the corresponding Field classes, which process the columns of each example.

examples¶

A list containing the instances of the dataset as Example classes.

Type: list

fields¶

A list of Field objects defining preprocessing for data fields of the dataset.

Type: list

Creates a dataset with the given examples and their fields.

Parameters

examples (list) – A list of examples.
fields (list) – A list of fields that the examples have been created with.
sort_key (callable) – A key to use for sorting dataset examples, used for batching together examples with similar lengths to minimize padding.

__getitem__(i)[source]¶

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10] # Multi-indexing returns a new dataset containing
# the indexed examples.

Parameters: i (int or slice or iterable) – Index used to index examples.
Returns: If i is an int, a single example will be returned. If i is a slice or iterable, a copy of this dataset containing only the indexed examples will be returned.
Return type: single example or Dataset

__getstate__()[source]¶

Method obtains dataset state. It is used for pickling dataset data to file.

Returns: state – dataset state dictionary
Return type: dict

__iter__()[source]¶

Iterates over all examples in the dataset in order.

Yields: example – Yields examples in the dataset.

__len__()[source]¶

Returns the number of examples in the dataset.

Returns: The number of examples in the dataset.
Return type: int

__setstate__(state)[source]¶

Method sets dataset state. It is used for unpickling dataset data from file.

Parameters: state (dict) – dataset state dictionary

filter(predicate, inplace=False)[source]¶

Method filters examples with given predicate.

Parameters

predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
inplace (bool, default False) – if True, do operation inplace and return None

filtered(predicate)[source]¶

Filters examples with given predicate and returns a new DatasetBase instance containing those examples.

Parameters: predicate (callable) – predicate should be a callable that accepts example as input and returns true if the example shouldn’t be filtered, otherwise returns false
Returns: A new DatasetBase instance containing only the Examples for which predicate returned True.
Return type: DatasetBase

static from_dataset(dataset)[source]¶

Creates an Dataset instance from a podium.datasets.DatasetBase instance.

Parameters: dataset (DatasetBase) – DatasetBase instance to be used to create the Dataset.
Returns: Dataset instance created from the passed DatasetBase instance.
Return type: Dataset

get(i, deep_copy=False)[source]¶

Returns an example or a new dataset containing the indexed examples.

If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned. The new dataset will contain copies of the old dataset’s fields and will be identical to the original dataset, with the exception of the example number and ordering. See wiki for detailed examples.

Example:

# Indexing by a single integers returns a single example
example = dataset.get(1)

# Same as the first example, but returns a deep_copy of the example
example_copy = dataset.get(1, deep_copy=True)

# Multi-indexing returns a new dataset containing the indexed examples
s = slice(1, 10)
new_dataset = dataset.get(s)

new_dataset_copy = dataset.get(s, deep_copy=True)

Parameters

i (int or slice or iterable) – Index used to index examples.
deep_copy (bool) – If true, the returned dataset will contain deep-copies of this dataset’s examples and fields. If false, existing examples and fields will be reused.

Returns

If i is an int, a single example will be returned. If i is a slice or iterable, a dataset containing only the indexed examples will be returned.

Return type

single example or Dataset

numericalize_examples()[source]¶

Generates and caches numericalized data for every example in the dataset.

Call before using the dataset to avoid lazy numericalization during iteration.

shuffle_examples(random_state=None)[source]¶

Shuffles the examples in this dataset.

Parameters: random_state (int) – The random seed used for shuffling.

split(split_ratio=0.7, stratified=False, strata_field_name=None, random_state=None, shuffle=True)[source]¶

Creates train-(validation)-test splits from this dataset.

The splits are new Dataset objects, each containing a part of this one’s examples.

Parameters

split_ratio ((float | list[float] | tuple[float])) – If type is float, a number in the interval (0.0, 1.0) denoting the amount of data to be used for the train split (the rest is used for test). If type is list or tuple, it should be of length 2 (or 3) and the numbers should denote the relative sizes of train, (valid) and test splits respectively. If the relative size for valid is missing (length is 2), only the train-test split is returned (valid is taken to be 0.0). Also, the relative sizes don’t have to sum up to 1.0 (they are normalized automatically). The ratio must not be so unbalanced that it would result in either of the splits being empty (having zero elements). Default is 0.7 (for the train set).
stratified (bool) – Whether the split should be stratified. A stratified split means that for each concrete value of the strata field, the given train-val-test ratio is preserved. Usually used on fields representing labels / classes, so that every class is present in each of our splits with the same percentage as in the entire dataset. Default is False.
strata_field_name (str) – Name of the field that is to be used to do the stratified split. Only relevant when ‘stratified’ is true. If the name of the strata field is not provided (the default behaviour), the stratified split will be done over the first field that is a target (its ‘is_target’ attribute is True). Note that the values of the strata field have to be hashable. Default is None.
random_state (int) – The random seed used for shuffling.

Returns

Datasets for train, (validation) and test splits in that order, depending on the split ratios that were provided.

Return type

tuple[Dataset]

Raises

ValueError – If the given split ratio is not in one of the valid forms. If the given split ratio is in a valid form, but wrong in the sense that it would result with at least one empty split. If stratified is True and the field with the given strata_field_name doesn’t exist.

TabularDataset¶

class podium.datasets.TabularDataset(path, fields, format='csv', line2example=None, skip_header=False, csv_reader_params={}, **kwargs)[source]¶

A dataset type for data stored in a single CSV, TSV or JSON file, where each row of the file is a single example.

Creates a TabularDataset from a file containing the data rows and an object containing all the fields that we are interested in.

Parameters

path (str) – Path to the data file.
fields ((list | dict)) –
A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.

If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).
format (str) – The format of the data file. Has to be either “CSV”, “TSV”, “JSON” (case-insensitive). Ignored if line2example is not None. Defaults to “CSV”.
line2example (callable) – The function mapping from a file line to Fields. In case your dataset is not in one of the standardized formats, you can provide a function which performs a custom split for each input line.
skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.
delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.
csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

Raises

ValueError – If the format given is not one of “CSV”, “TSV” or “JSON” and line2example is not set. If fields given as a dict and skip_header is True. If format is “JSON” and skip_header is True.

DiskBackedDataset¶

class podium.datasets.arrow.DiskBackedDataset(table, fields, cache_path, mmapped_file, data_types=None)[source]¶

Podium dataset implementation which uses PyArrow as its data storage backend.

Examples are stored in a file which is then memory mapped for fast random access.

Creates a new DiskBackedDataset instance. Users should use static constructor functions like ‘from_dataset’ to construct new DiskBackedDataset instances.

Parameters

table (pyarrow.Table) – Table object that contains example data.
fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the examples in ‘table’.
cache_path (Optional[str]) – Path to the directory where the cache file is saved.
mmapped_file (pyarrow.MemoryMappedFile) – Open MemoryMappedFile descriptor of the cache file.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

__getitem__(item)[source]¶

Returns an example or a new DiskBackedDataset containing the indexed examples. If indexed with an int, only the example at that position will be returned. If Indexed with a slice or iterable, all examples indexed by the object will be collected and a new dataset containing only those examples will be returned.

Examples in the returned Dataset are the same ones present in the original dataset. If a complete deep-copy of the dataset, or its slice, is needed please refer to the get method.

Usage example:

example = dataset[1] # Indexing by single integer returns a single example

new_dataset = dataset[1:10:2] # Multiindexing returns a new dataset containing
# the indexed examples.

new_dataset_2 = dataset[(1,5,6,9)] # Returns a new dataset containing the
# indexed Examples.

Parameters: item (int or slice or iterable) – Index used to index examples.
Returns: If i is an int, a single example will be returned. If i is a slice, list or tuple, a copy of this dataset containing only the indexed examples will be returned.
Return type: Example or Dataset

__iter__()[source]¶

Iterates over Examples in this dataset.

Returns: Iterator over all the examples in this dataset.
Return type: Iterator[Example]

__len__()[source]¶

Returns the number of Examples in this Dataset.

Returns: The number of Examples in this Dataset.
Return type: int

close()[source]¶

Closes resources held by the DiskBackedDataset.

Only closes the cache file handle. The cache will not be deleted from disk. For cache deletion, use delete_cache.

delete_cache()[source]¶

Deletes the cache directory and all cache files belonging to this dataset.

After this call is executed, any DiskBackedDataset created by slicing/indexing this dataset and any view over this dataset will not be usable any more. Any dataset created from this dataset should be dumped to a new directory before calling this method.

dump_cache(cache_path=None)[source]¶

Saves this dataset at cache_path. Dumped datasets can be loaded with the DiskBackedDataset.load_cache static method. All fields contained in this dataset must be serializable using pickle.

Parameters

cache_path (Optional[str]) –

Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.

Returns

The chosen cache directory path. Useful when cache_path is None and a temporary directory is created.

Return type

str

static from_dataset(dataset, cache_path=None, data_types=None)[source]¶

Creates a DiskBackedDataset instance from a podium.datasets.DatasetBase instance.

Parameters

dataset (DatasetBase) – DatasetBase instance to be used to create the DiskBackedDataset.
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.

Returns

DiskBackedDataset instance created from the passed DatasetBase instance.

Return type

DiskBackedDataset

static from_examples(fields, examples, cache_path=None, data_types=None, chunk_size=1024)[source]¶

Creates a DiskBackedDataset from the provided Examples.

Parameters

fields (Union[Dict[str, Field], List[Field]]) – Dict or List of Fields used to create the Examples.
examples (Iterable[Example]) – Iterable of examples.
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.
chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.

Returns

DiskBackedDataset instance created from the passed Examples.

Return type

DiskBackedDataset

static from_tabular_file(path, format, fields, cache_path=None, data_types=None, chunk_size=10000, skip_header=False, delimiter=None, csv_reader_params=None)[source]¶

Loads a tabular file format (csv, tsv, json) as a DiskBackedDataset.

Parameters

path (str) – Path to the data file.
format (str) – The format of the data file. Has to be either “CSV”, “TSV”, or “JSON” (case-insensitive).
fields (Union[Dict[str, Field], List[Field]]) –
A mapping from data columns to example fields. This allows the user to rename columns from the data file, to create multiple fields from the same column and also to select only a subset of columns to load.

A value stored in the list/dict can be either a Field (1-to-1 mapping), a tuple of Fields (1-to-n mapping) or None (ignore column).

If type is list, then it should map from the column index to the corresponding field/s (i.e. the fields in the list should be in the same order as the columns in the file). Also, the format must be CSV or TSV.

If type is dict, then it should be a map from the column name to the corresponding field/s. Column names not present in the dict’s keys are ignored. If the format is CSV/TSV, then the data file must have a header (column names need to be known).
cache_path (Optional[str]) –
Path to the directory where the cache file will saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.
data_types (Dict[str, Tuple[pyarrow.DataType, pyarrow.DataType]]) – Dictionary mapping field names to pyarrow data types. This is required when a field can have missing data and the data type can’t be inferred. The data type tuple has two values, corresponding to the raw and tokenized data types in an example. None can be used as a wildcard data type and will be overridden by an inferred data type if possible.
chunk_size (int) – Maximum number of examples to be loaded before dumping to the on-disk cache file. Use lower number if memory usage is an issue while loading.
skip_header (bool) – Whether to skip the first line of the input file. If format is CSV/TSV and ‘fields’ is a dict, then skip_header must be False and the data file must have a header. Default is False.
delimiter (str) – Delimiter used to separate columns in a row. If set to None, the default delimiter for the given format will be used.
csv_reader_params (Dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

Returns

DiskBackedDataset instance containing the examples from the tabular file.

Return type

DiskBackedDataset

static load_cache(cache_path)[source]¶

Loads a cached DiskBackedDataset contained in the cache_path directory. Fields will be loaded into memory but the Example data will be memory mapped avoiding unnecessary memory usage.

Parameters

cache_path (Optional[str]) –

Path to the directory where the cache file will be saved. The whole directory will be used as the cache and will be deleted when delete_cache is called. It is recommended to create a new directory to use exclusively as the cache, or to leave this as None.

If None, a temporary directory will be created.

Returns

the DiskBackedDataset loaded from the passed cache directory.

Return type

DiskBackedDataset

HuggingFaceDatasetConverter¶

class podium.datasets.hf.HFDatasetConverter(hf_dataset, fields=None)[source]¶

Class for converting rows from the HuggingFace Datasets to podium.Examples.

HFDatasetConverter constructor.

Parameters

hf_dataset (datasets.Dataset) – HuggingFace Dataset.
fields (dict(str, podium.Field)) – Dictionary that maps a column name of the dataset to a podium.Field. If passed None the default feature conversion rules will be used to build a dictonary from the dataset features.

Raises

TypeError – If dataset is not an instance of datasets.Dataset.

__iter__()[source]¶: Iterate through the dataset and convert the examples.

as_dataset()[source]¶

Convert the original HuggingFace dataset to a podium.Dataset.

Returns: podium.Dataset instance.
Return type: podium.Dataset

static from_dataset_dict(dataset_dict, fields=None, cast_to_podium=False)[source]¶

Copies the keys of given dictionary and converts the corresponding HuggingFace Datasets to the HFDatasetConverter instances.

Parameters

dataset_dict (dict(str, datasets.Dataset)) – Dictionary that maps dataset names to HuggingFace Datasets.
cast_to_podium (bool) – Determines whether to immediately convert the HuggingFace dataset to Podium dataset (if True), or shallowly wrap the HuggingFace dataset in the HFDatasetConverter class. The HFDatasetConverter class currently doesn’t support full Podium functionality and will not work with other components in the library.

Returns

Dictionary that maps dataset names to HFDatasetConverter instances.

Return type

dict(str, Union[HFDatasetConverter, podium.Dataset])

Raises

TypeError – If the given argument is not a dictionary.

CoNLLUDataset¶

class podium.datasets.CoNLLUDataset(file_path, fields=None)[source]¶

A CoNLL-U dataset class.

This class uses all default CoNLL-U fields.

Dataset constructor.

Parameters

file_path (str) – Path to the file containing the dataset.
fields (Dict[str, Field]) – Dictionary that maps the CoNLL-U field name to the field. If passed None the default set of fields will be used.

static get_default_fields()[source]¶

Method returns a dict of default CoNLL-U fields.

Returns: fields – Dict containing all default CoNLL-U fields.
Return type: Dict[str, Field]

Built-in datasets (EN)¶

Stanford Sentiment Treebank¶

class podium.datasets.impl.SST(file_path, fields, fine_grained=False, subtrees=False)[source]¶

The Stanford sentiment treebank dataset.

NAME¶

dataset name

Type: str

URL¶

url to the SST dataset

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TEXT_FIELD_NAME¶

name of the field containing comment text

Type: str

LABEL_FIELD_NAME¶

name of the field containing label value

Type: str

POSITIVE_LABEL¶

positive sentiment label

Type: int

NEGATIVE_LABEL¶

negative sentiment label

Type: int

Dataset constructor. User should use static method get_dataset_splits rather than using the constructor directly.

Parameters

dir_path (str) – path to the directory containing datasets
fields (dict(str, Field)) – dictionary that maps field name to the field
fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

static get_dataset_splits(fields=None, fine_grained=False, subtrees=False)[source]¶

Method loads and creates dataset splits for the SST dataset.

Parameters

fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.
fine_grained (bool) – if false, returns the binary (positive/negative) SST dataset and filters out neutral examples. If this is False, please set your Fields not to be eager.
subtrees (bool) – also return the subtrees of each input instance as separate instances. This causes the dataset to become much larger.

Returns

(train_dataset, valid_dataset, test_dataset) – tuple containing train, valid and test dataset

Return type

(Dataset, Dataset, Dataset)

static get_default_fields()[source]¶

Method returns default Imdb fields: text and label.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

Internet Movie DataBase¶

class podium.datasets.impl.IMDB(dir_path, fields)[source]¶

Simple Imdb dataset with only supervised data which uses non processed data.

NAME¶

dataset name

Type: str

URL¶

url to the imdb dataset

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

POSITIVE_LABEL_DIR¶

name of the subdirectory containing examples with positive sentiment

Type: str

NEGATIVE_LABEL_DIR¶

name of the subdirectory containing examples with negative sentiment

Type: str

TEXT_FIELD_NAME¶

name of the field containing comment text

Type: str

LABEL_FIELD_NAME¶

name of the field containing label value

Type: str

POSITIVE_LABEL¶

positive sentiment label

Type: int

NEGATIVE_LABEL¶

negative sentiment label

Type: int

Dataset constructor. User should use static method get_dataset_splits rather than using directly constructor.

Parameters

dir_path (str) – path to the directory containing datasets
fields (dict(str, Field)) – dictionary that maps field name to the field

static get_dataset_splits(fields=None)[source]¶

Method creates train and test dataset for Imdb dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`. User should use default field names defined in class attributes.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

static get_default_fields()[source]¶

Method returns default Imdb fields: text and label.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

Stanford Natural Language Inference¶

class podium.datasets.impl.SNLISimple(file_path, fields)[source]¶

A Simple SNLI Dataset class. This class only uses three fields by default: gold_label, sentence1, sentence2.

NAME¶

Name of the Dataset.

Type: str

URL¶

URL to the SNLI dataset.

Type: str

DATASET_DIR¶

Name of the directory in which the dataset files are stored.

Type: str

ARCHIVE_TYPE¶

Archive type, i.e. compression method used for archiving the downloaded dataset file.

Type: str

TRAIN_FILE_NAME¶

Name of the file in which the train dataset is stored.

Type: str

TEST_FILE_NAME¶

Name of the file in which the test dataset is stored.

Type: str

DEV_FILE_NAME¶

Name of the file in which the dev (validation) dataset is stored.

Type: str

GOLD_LABEL_FIELD_NAME¶

Name of the field containing gold label

Type: str

SENTENCE1_FIELD_NAME¶

Name of the field containing sentence1

Type: str

SENTENCE2_FIELD_NAME¶

Name of the field containing sentence2

Type: str

Dataset constructor. This method should not be used directly, get_train_test_dev_dataset should be used instead.

Parameters

file_path (str) – Path to the .jsonl file containing the dataset.
fields (dict(str, Field)) – A dictionary that maps field names to Field objects.

static get_default_fields()[source]¶

Method returns the three main SNLI fields in the following order: gold_label, sentence1, sentence2.

Returns: fields – Dictionary mapping field names to respective Fields.
Return type: dict(str, Field)

static get_train_test_dev_dataset(fields=None)[source]¶

Method creates train, test and dev (validation) Datasets for the SNLI dataset. If the snli_1.0 directory is not present in the current/working directory, it will be downloaded automatically.

Parameters: fields (dict(str, Field), optional) – A dictionary that maps field names to Field objects. If not supplied, `get_default_fields` is used.
Returns: (train_dataset, test_dataset, dev_dataset) – A tuple containing train, test and dev Datasets respectively.
Return type: (Dataset, Dataset, Dataset)

Eurovoc Dataset¶

class podium.datasets.impl.EuroVocDataset(eurovoc_labels, crovoc_labels, documents, mappings, fields=None)[source]¶

EuroVoc dataset class that contains labeled documents and the label hierarchy.

Dataset constructor.

Parameters

eurovoc_labels (dict(int : Label)) – dictionary mapping eurovoc label_ids to labels
crovoc_labels (dict(int : Label)) – dictionary mapping crovoc label_ids to labels
documents (list(Document)) – list of all documents in dataset
mappings (dict(int : list(int))) – dictionary that maps documents_ids to list of their label_ids
fields (dict(str : Field)) – dictionary that maps field name to the field

get_all_ancestors(label_id)[source]¶

Returns ids of all ancestors of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all ancestors of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_crovoc_label_hierarchy()[source]¶

Returns CroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

static get_default_fields()[source]¶

Method returns default EuroVoc fields: title, text, eurovoc and crovoc labels.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

get_direct_parents(label_id)[source]¶

Returns ids of direct parents of the label with the given label id.

Parameters: label_id (int) – id of the label
Returns: list of label_ids of all direct parents of the given label or None if the label is not present in the dataset label hierarchies
Return type: list(int)

get_eurovoc_label_hierarchy()[source]¶

Returns the EuroVoc label hierarchy.

Returns: dict(int – dictionary that maps label id to label
Return type: Label)

is_ancestor(label_id, example)[source]¶

Checks if the given label_id is an ancestor of any labels of the example.

Parameters

label_id (int) – id of the label
example (Example) – example from dataset

Returns

True if label is ancestor to any of the example labels, False otherwise

Return type

boolean

Cornell Movie Dialogs Dataset¶

class podium.datasets.impl.CornellMovieDialogsConversationalDataset(data, fields=None)[source]¶

Cornell Movie Dialogs Conversational dataset which contains sentences and replies from movies.

Dataset constructor.

Parameters

data (CornellMovieDialogsNamedTuple) – cornell movie dialogs data
fields (dict(str : Field)) – dictionary that maps field name to the field

Raises

ValueError – If given data is None.

static get_default_fields()[source]¶

Method returns default Cornell Movie Dialogs fields: sentence and reply. Fields share same vocabulary.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

Built-in datasets (HR)¶

Pauza Reviews Dataset¶

class podium.datasets.impl.PauzaHRDataset(dir_path, fields)[source]¶

Simple PauzaHR dataset class which uses original reviews.

URL¶

url to the PauzaHR dataset

Type: str

NAME¶

dataset name

Type: str

DATASET_DIR¶

name of the folder in the dataset containing train and test directories

Type: str

ARCHIVE_TYPE¶

string that defines archive type, used for unpacking dataset

Type: str

TRAIN_DIR¶

name of the training directory

Type: str

TEST_DIR¶

name of the directory containing test examples

Type: str

Dataset constructor. User should use static method get_train_test_dataset rather than using directly constructor.

Parameters

dir_path (str) – path to the directory containing datasets
fields (dict(str, Field)) – dictionary that maps field name to the field

static get_default_fields()[source]¶

Method returns default PauzaHR fields: rating, source and text.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

static get_train_test_dataset(fields=None)[source]¶

Method creates train and test dataset for PauzaHR dataset.

Parameters: fields (dict(str, Field), optional) – dictionary mapping field name to field, if not given method will use `get_default_fields`.
Returns: (train_dataset, test_dataset) – tuple containing train dataset and test dataset
Return type: (Dataset, Dataset)

Named Entity Recognition dataset¶

class podium.datasets.impl.CroatianNERDataset(tokenized_documents, fields)[source]¶

Croatian NER dataset.

A single example in the dataset represents a single sentence in the input data.

Dataset constructor. Users should use the static method get_dataset rather than invoking the constructor directly.

Parameters

tokenized_documents (list(list(str, str))) – List of tokenized documents. Each document is represented as a list of tuples (token, label). The sentences in document are delimited by tuple (None, None)
fields (list(Field)) – Dictionary that maps field name to the field

classmethod get_dataset(tokenizer='split', tag_schema='IOB', fields=None, **kwargs)[source]¶

Method downloads (if necessary) and loads the dataset.

Parameters

tokenizer (str | callable) – Word-level tokenizer used to tokenize the input text
tag_schema (str) –
Tag schema used for constructing the token labels

supported tag schemas:
’IOB’: the label of the beginning token of the entity is prefixed with ‘B-‘, the remaining tokens that belong to the same entity are prefixed with ‘I-‘. The tokens that don’t belong to any named entity are labeled ‘O’
fields (dict(str, Field)) – dictionary mapping field names to fields. If set to None, the default fields are used.
**kwargs –

SCPLargeResource.SCP_USER_KEY:
User on the host machine. Not required if the user on the local machine matches the user on the host machine.

SCPLargeResource.SCP_PRIVATE_KEY:
Path to the ssh private key eligible to access the host machine. Not required on Unix if the private is in the default location.

SCPLargeResource.SCP_PASS_KEY:
Password for the ssh private key (optional). Can be omitted if the private key is not encrypted.

Returns

The loaded dataset.

Return type

CroatianNERDataset

static get_default_fields()[source]¶

Method returns default Croatian NER dataset fields.

Returns: fields – Dictionary mapping field name to field.
Return type: dict(str, Field)

Catacx Datasets¶

class podium.datasets.impl.CatacxDataset(dir_path, fields=None)[source]¶

Catacx dataset.

Dataset constructor, should be given the path to the .json file which contains the Catacx dataset.

Parameters

dir_path (str) – path to the file containing the dataset.
fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used

static get_dataset(fields=None)[source]¶

Downloads (if necessary) and loads the dataset. Not supported yet. Raises NotImplementedError if called.

Parameters: fields (dict(str, Field)) – dictionary that maps field name to the field if passed None the default set of fields will be used.
Returns: The loaded dataset.
Return type: CatacxDataset

static get_default_fields()[source]¶

Method returns a dict of default Catacx fields.

Returns: fields – dict containing all default Catacx fields
Return type: dict(str, Field)