TakeLab software and resources
This page contains a summary of freely available software and resources created by
TakeLab at the University of Zagreb, Croatia.
Tools and libraries
- Coral - A corpus aligner. Currently offline
- CroNER - A Croatian Named Entity Recognizer.
- DiaCRO - A diacritics restaurator for Croatian. Currently offline
- GPKEX - Genetically programmed keyphrase extraction.
- libsentences - A library for sentence boundary detection.
- MINERAL - A tool for extracting disease mentions from clinical text.
- TakeLab STS - A semantic textual similarity system from SemEval 2012 shared task.
- TermeX - A terminology extraction tool.
- TweetingJay - A tool for recognizing semantically identical tweets (in English).
Resources
- agenda - Preprocessed news articles annotated with topics.
- argpremises - A corpus of matched claims with implicit premises.
- ComArg - A corpus of online user comments with arguments.
- Cro6WSD - A small Croatian word sense disambiguation dataset.
- Cro36WSD - A medium multi-label Croatian word sense disambiguation datasets.
- CroCoref - A corpus with manually annotated entity coreference.
- CroLexSub - A preliminary lexical substitution dataset for Croatian. New
- CroMWE - Human-annotated examples of MWEs and non-MWE n-grams in Croatian, extended with linguistic feature labels.New
- CroMWEsc - A dataset annotated with semantic compositionality of Croatian Multiword Expressions.
- Cropinion - Opinion mining from Croatian user reviews dataset.
- CroSentCmp - A comparison of different sentiment classification methods for Croatian. New
- CroSyn - A synonym choice dataset for Croatian.
- CroSemRel450 - A word semantic relatedness dataset for Croatian.
- CroWSI - Graph-based induction of word senses in Croatian.
- DerivBase.hr - A large-coverage derivational morphology resource for Croatian.
- dm.hr - A distributional memory for Croatian.
- Event-centered information retrieval evaluation collections - Two collections of queries and documents in English for event-centered information retrieval.
- An event coreference dataset - A dataset with annotated event coreference (English).
- A factual event anchor extraction dataset - A dataset of 750 English newswire texts annotated for factual event mentions and 105 manually annotated event graphs.
- FAQ retrieval dataset - A dataset with queries and relevance judgements for FAQ retrieval in Croatian.
- fHrWaC - A filtered version of the hrWaC corpus (Croatian).
- HeidelTime.hr - Croatian resources for the HeidelTime tagger.
- kex.hr - A keyphrase extraction evaluation dataset for Croatian.
- NN13205 - An indexed Croatian legislative collection.
- HOFM - A higher-order functional morphology framework.
- MOLEX - A morphological lexicon for Croatian. Currently offline
- Recognizing identical events dataset - A dataset with annotated identical and similar events (English).
- Semantic analogies dataset - A dataset with semantic analogies.
- Sentiment Lexica - A prior sentiment lexica for English and Croatian.
- VerbCROcean - A repository of fine-grained semantic verb relations for Croatian.
- WikiWarsHr - A temporally tagged corpus of historical narratives (Croatian).
- wmic-lexsub - Representing Word Meaning in Context via Lexical Substitutes New
- PreTox - A dataset for preemptive toxic language detection in wikipedia comments.
Last revision: 25 September 2019