TakeLab software and resources

This page contains a summary of freely available software and resources created by TakeLab at the University of Zagreb, Croatia.

Tools and libraries

Coral - A corpus aligner. Currently offline
CroNER - A Croatian Named Entity Recognizer.
DiaCRO - A diacritics restaurator for Croatian. Currently offline
GPKEX - Genetically programmed keyphrase extraction.
libsentences - A library for sentence boundary detection.
MINERAL - A tool for extracting disease mentions from clinical text.
TakeLab STS - A semantic textual similarity system from SemEval 2012 shared task.
TermeX - A terminology extraction tool.
TweetingJay - A tool for recognizing semantically identical tweets (in English).

Resources

agenda - Preprocessed news articles annotated with topics.
argpremises - A corpus of matched claims with implicit premises.
ComArg - A corpus of online user comments with arguments.
Cro6WSD - A small Croatian word sense disambiguation dataset.
Cro36WSD - A medium multi-label Croatian word sense disambiguation datasets.
CroCoref - A corpus with manually annotated entity coreference.
CroLexSub - A preliminary lexical substitution dataset for Croatian. New
CroMWE - Human-annotated examples of MWEs and non-MWE n-grams in Croatian, extended with linguistic feature labels.New
CroMWEsc - A dataset annotated with semantic compositionality of Croatian Multiword Expressions.
Cropinion - Opinion mining from Croatian user reviews dataset.
CroSentCmp - A comparison of different sentiment classification methods for Croatian. New
CroSyn - A synonym choice dataset for Croatian.
CroSemRel450 - A word semantic relatedness dataset for Croatian.
CroWSI - Graph-based induction of word senses in Croatian.
DerivBase.hr - A large-coverage derivational morphology resource for Croatian.
dm.hr - A distributional memory for Croatian.
Event-centered information retrieval evaluation collections - Two collections of queries and documents in English for event-centered information retrieval.
An event coreference dataset - A dataset with annotated event coreference (English).
A factual event anchor extraction dataset - A dataset of 750 English newswire texts annotated for factual event mentions and 105 manually annotated event graphs.
FAQ retrieval dataset - A dataset with queries and relevance judgements for FAQ retrieval in Croatian.
fHrWaC - A filtered version of the hrWaC corpus (Croatian).
HeidelTime.hr - Croatian resources for the HeidelTime tagger.
kex.hr - A keyphrase extraction evaluation dataset for Croatian.
NN13205 - An indexed Croatian legislative collection.
HOFM - A higher-order functional morphology framework.
MOLEX - A morphological lexicon for Croatian. Currently offline
Recognizing identical events dataset - A dataset with annotated identical and similar events (English).
Semantic analogies dataset - A dataset with semantic analogies.
Sentiment Lexica - A prior sentiment lexica for English and Croatian.
VerbCROcean - A repository of fine-grained semantic verb relations for Croatian.
WikiWarsHr - A temporally tagged corpus of historical narratives (Croatian).
wmic-lexsub - Representing Word Meaning in Context via Lexical Substitutes New
PreTox - A dataset for preemptive toxic language detection in wikipedia comments.

Last revision: 25 September 2019