Data vectorizers¶
Tf-Idf vectorizer¶
-
class
podium.vectorizers.
TfIdfVectorizer
(vocab=None, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False, specials=None)[source]¶ Class converts data from one field in examples to matrix of tf-idf features.
It is equivalent to scikit-learn TfidfVectorizer available at https://scikit-learn.org. Class is dependant on TfidfTransformer defined in scikit-learn library.
Constructor that initializes tfidf vectorizer. Parameters besides vocab are passed to TfidfTransformer, for further details on these parameters see scikit-learn documentation.
- Parameters
vocab (Vocab, optional) – vocabulary instance that can be given as field.vocab or as vocab from other source. If None, it will be initialized during fit from field.
norm – see scikit tfidf transformer documentation
use_idf – see scikit tfidf transformer documentation
smooth_idf – see scikit tfidf transformer documentation
sublinear_tf – see scikit tfidf transformer documentation
specials (list(str), optional) – list of tokens for which tfidf is not calculated, if None vocab specials are used
-
fit
(dataset, field)[source]¶ Learn idf from dataset on data in given field.
- Parameters
- Returns
self
- Return type
- Raises
ValueError – If dataset or field are None and if name of given field is not in dataset.
-
transform
(examples, **kwargs)[source]¶ Transforms examples to example-term matrix. Uses vocabulary that is given in constructor.
- Parameters
example (iterable) – an iterable which yields array with numericalized tokens
- Returns
X – Tf-idf weighted document-term matrix
- Return type
sparse matrix, [n_samples, n_features]
- Raises
ValueError – If examples are None.
RuntimeError – If vectorizer is not fitted yet.
GloVe vectorizer¶
-
class
podium.vectorizers.
GloVe
(name='glove-wikipedia', dim=300, default_vector_function=<function random_normal_default_vector>, cache_path=None, max_vectors=None)[source]¶ Class represents concrete vector storage for GloVe vectors described in https://nlp.stanford.edu/projects/glove/ . Class contains a Large resource so that vectors could be automatically downloaded on first use.
-
NAME_URL_MAPPING
¶ dictionary that maps glove instance name to download url
- Type
dict(str, str)
-
NAME_DIM_MAPPING
¶ dictionary that maps glove instance name to available vector dimensions for given instance
- Type
dict(str, set(str))
-
_NAME_FILE_MAPPING
¶ dictionary that maps glove instance name to filenames available in vectors folder
- Type
dict(str, str)
-
_ARCHIVE_TYPE
¶ type of arhive in which the vectors are stored while downloading
- Type
str
-
_BINARY
¶ defines if the vectors are stored in binary format or not. glove vectors are stored in binary format
- Type
bool
GloVe constructor that initializes vector storage and downloads vectors if necessary.
- Parameters
name (str) – name of glove vectors instance, available names are available in NAME_URL_MAPPING dictionary
dim (int) – vectors dimension, available dimensions are listed in NAME_DIM_MAPPING dictionary
default_vector_function (callable, optional) – which vector should be returned if vectorizer doesn’t have representation for given token. If None and token doesn’t exists an error is raised while obtaining a vector
cache_path (str) – path for caching vectors, useful if not loading all vectors from file by either loading some arbitrary number of vectors (see max_vectors) or by loading vectors for vocabulary.
max_vectors (int) – maximum number of vectors to load in memory
- Raises
ValueError – If given name is not in NAME_URL_MAPPING keys or if the given vectors dimension is not available. Supported dimensions are available in NAME_DIM_MAPPING dictionary.
-
Nlpl vectorizer¶
-
class
podium.vectorizers.
NlplVectorizer
(default_vector_function=<function zeros_default_vector>, cache_path=None, max_vectors=None)[source]¶ Vectorizer base class constructor.
- Parameters
path (str) – path to stored vectors
default_vector_function (callable, optional) – which vector should be returned if vectorizer doesn’t have representation for given token. If None and token doesn’t exists an error is raised while obtaining a vector
cache_path (str, optional) – path to cached vectors. Caching vectors should be used when using vocab for loading vectors or when limiting number of vectors to load
max_vectors (int, optional) – maximum number of vectors to load in memory
Default vector initializers¶
-
podium.vectorizers.
zeros_default_vector
(token, dim)[source]¶ Function for creating default vector for given token in form of zeros array. Dimension of returned array is equal to given dim.
- Parameters
token (str) – string token from vocabulary
dim (int) – vector dimension
- Returns
vector – zeros vector with given dimension
- Return type
array-like
- Raises
If dim is None. –
-
podium.vectorizers.
random_normal_default_vector
(token, dim)[source]¶ Draw a random vector from a standard normal distribution. Dimension of returned array is equal to given dim.
- Parameters
token (str) – string token from vocabulary
dim (int) – vector dimension
- Returns
vector – sampled from normal distribution with given dimension
- Return type
array-like