Podium

Getting Started

  • Installation
    • Install via pip
    • Install via conda
    • Installing from source
  • Quickstart
    • Preprocessing data with Fields
    • Adding your own preprocessing with hooks
    • Mapping tokens to indices
    • Retrieving processed data
    • Minibatching data
  • Walkthrough
    • Loading datasets
      • Built-in datasets
      • Loading 🤗 datasets
      • Loading your custom dataset
    • The Vocabulary
      • Finalizing vocabularies
      • Customizing Vocabs
    • Customizing the preprocessing pipeline with Fields
      • LabelField
    • Iterating over datasets
    • Loading pretrained word vectors
    • Using TF-IDF or count vectorization

In Depth Overview

  • Podium data flow
  • How to interact with Fields
    • Lowercase as a pretokenization hook
    • Removing punctuation as a post-tokenization hook
    • Putting it all together
  • Special tokens
  • Custom numericalization functions
  • Fields with multiple outputs
    • The Multioutput Field
  • Dataset manipulation
    • Dataset splitting
    • Dataset concatenation
  • Bucketing instances when iterating
  • Saving and loading Podium components
  • Documentation chapters in progress
    • Handling datasets with missing data
    • Creating your own Dataset subclass

Examples

  • TFIDF + scikit-learn SVM
    • Using ngram features
  • Pytorch RNN classifier
    • Loading a dataset from 🤗/datasets
    • Loading pretrained embeddings
    • Defining a simple neural model in Pytorch
    • Minibatching data in Podium

Preprocessing Tools

  • Hooks
    • Moses Normalizer
    • Regex Replace
    • Text Cleanup
    • NLTK Stemmer
    • Spacy Lemmatizer
    • Truecase
    • Stopword removal
    • Keyword extraction
  • Utilities
    • SpaCy sentencizer
    • Hook conversion

Package Reference

  • Fields and Vocab
    • Field
    • MultioutputField
    • LabelField
    • MultilabelField
    • Vocab
    • Special tokens
      • The unknown token
      • The padding token
      • The beginning-of-sequence token
      • The end-of-sequence token
  • Built-in preprocessing tools
    • Hooks
      • Moses Normalizer
      • Regex Replace
      • Keyword Extractor
      • Text Cleanup
      • NLTK Stemmer
      • Spacy Lemmatizer
      • Truecasing
      • Stopwords removal
    • Tokenizers
    • Utilities
  • Dataset classes
    • Dataset
    • TabularDataset
    • DiskBackedDataset
    • HFDatasetConverter
    • CoNLLUDataset
    • Built-in datasets (EN)
      • Stanford Sentiment Treebank
      • Internet Movie DataBase
      • Stanford Natural Language Inference
      • Cornell Movie Dialogs
  • Iterators
    • Iterator
    • BucketIterator
    • SingleBatchIterator
    • HierarchicalIterator
  • Data vectorizers
    • Tf-Idf vectorizer
    • GloVe vectorizer
    • Nlpl vectorizer
    • Default vector initializers

Modules Under Development

  • Abstractions for model training with Podium
    • Models
    • Pipeline
    • Model selection
    • Model validation
  • Concrete implementations of models
Podium
  • »
  • Search


© Copyright 2020, TakeLab, FER, Zagreb

Built with Sphinx using a theme provided by Read the Docs.