Data vectorizers¶

Tf-Idf vectorizer¶

class podium.vectorizers.TfIdfVectorizer(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)[source]¶

Class converts data from one field in examples to matrix of tf-idf features.

It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.

Constructor that initializes tfidf vectorizer. Parameters besides vocab are passed to TfidfTransformer, for further details on these parameters see scikit-learn documentation.

Parameters
  • vocab (Vocab, optional) – vocabulary instance that can be given as field.vocab or as vocab from other source. If None, it will be initialized during fit from field.

  • norm – see scikit tfidf transformer documentation

  • use_idf – see scikit tfidf transformer documentation

  • smooth_idf – see scikit tfidf transformer documentation

  • sublinear_tf – see scikit tfidf transformer documentation

  • specials (list(str), optional) – list of tokens for which tfidf is not calculated, if None vocab specials are used

fit(dataset, field)[source]¶

Learn idf from dataset on data in given field.

Parameters
  • dataset (Dataset) – dataset instance cointaining data on which to build idf matrix

  • field (Field) – which field in dataset to use for tfidf

Returns

self

Return type

TfIdfVectorizer

Raises

ValueError – If dataset or field are None and if name of given field is not in dataset.

transform(examples, **kwargs)[source]¶

Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.

Parameters

example (iterable) – an iterable which yields array with numericalized tokens

Returns

X – Tf-idf weighted document-term matrix

Return type

sparse matrix, [n_samples, n_features]

Raises
  • ValueError – If examples are None.

  • RuntimeError – If vectorizer is not fitted yet.

GloVe vectorizer¶

class podium.vectorizers.GloVe(name='glove-wikipedia', dim=300, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None)[source]¶

Class represents concrete vector storage for GloVe vectors described in https://nlp.stanford.edu/projects/glove/ . Class contains a Large resource so that vectors could be automatically downloaded on first use.

NAME_URL_MAPPING¶

dictionary that maps glove instance name to download url

Type

dict(str, str)

NAME_DIM_MAPPING¶

dictionary that maps glove instance name to available vector dimensions for given instance

Type

dict(str, set(str))

_NAME_FILE_MAPPING¶

dictionary that maps glove instance name to filenames available in vectors folder

Type

dict(str, str)

_ARCHIVE_TYPE¶

type of arhive in which the vectors are stored while downloading

Type

str

_BINARY¶

defines if the vectors are stored in binary format or not. glove vectors are stored in binary format

Type

bool

GloVe constructor that initializes vector storage and downloads vectors if necessary.

Parameters
  • name (str) – name of glove vectors instance, available names are available in NAME_URL_MAPPING dictionary

  • dim (int) – vectors dimension, available dimensions are listed in NAME_DIM_MAPPING dictionary

  • default_vector_function (callable, optional) – which vector should be returned if vectorizer doesn’t have representation for given token. If None and token doesn’t exists an error is raised while obtaining a vector

  • cache_path (str) – path for caching vectors, useful if not loading all vectors from file by either loading some arbitrary number of vectors (see max_vectors) or by loading vectors for vocabulary.

  • max_vectors (int) – maximum number of vectors to load in memory

Raises

ValueError – If given name is not in NAME_URL_MAPPING keys or if the given vectors dimension is not available. Supported dimensions are available in NAME_DIM_MAPPING dictionary.

Nlpl vectorizer¶

class podium.vectorizers.NlplVectorizer(default_vector_function=<function zeros_default_vector>, cache_path=None, max_vectors=None)[source]¶

Vectorizer base class constructor.

Parameters
  • path (str) – path to stored vectors

  • default_vector_function (callable, optional) – which vector should be returned if vectorizer doesn’t have representation for given token. If None and token doesn’t exists an error is raised while obtaining a vector

  • cache_path (str, optional) – path to cached vectors. Caching vectors should be used when using vocab for loading vectors or when limiting number of vectors to load

  • max_vectors (int, optional) – maximum number of vectors to load in memory

Default vector initializers¶

podium.vectorizers.zeros_default_vector(token, dim)[source]¶

Function for creating default vector for given token in form of zeros array. Dimension of returned array is equal to given dim.

Parameters
  • token (str) – string token from vocabulary

  • dim (int) – vector dimension

Returns

vector – zeros vector with given dimension

Return type

array-like

Raises

If dim is None. –

podium.vectorizers.random_normal_default_vector(token, dim)[source]¶

Draw a random vector from a standard normal distribution. Dimension of returned array is equal to given dim.

Parameters
  • token (str) – string token from vocabulary

  • dim (int) – vector dimension

Returns

vector – sampled from normal distribution with given dimension

Return type

array-like