Coreference Resolution Dataset for Croatian

Version: 1.0
Release date: September 4, 2015

1 Description

Coreference resolution is the task of identifying in text different mentions of the same real-world entity. We make available a set of news stories from Croatian newline "Vjesnik", manually annotated with coreference relations between entity mentions. The published dataset is the test set that we used for evaluating the performance of the constrained mention-pair model for coreference resolution for Croatian, published in the following paper:

Goran Glavaš, Jan Šnajder (2015). Resolving Entity Coreference in Croatian with a Constrained Mention-Pair Model. Proceedings of the The 5th Workshop on Balto-Slavic Natural Language Processing, at the Conference on Recent Advances in Natural Language Processing (RANLP 2015), Hissar. [pdf]

Should you decide to use the dataset, please cite the paper. The BibTeX format is:

@InProceedings{glavavs2015resolving,
title={Resolving Entity Coreference in Croatian with a Constrained Mention-Pair Model},
author={Glava\v{s}, Goran and {\v S}najder, Jan},
booktitle={The 5th Workshop on Balto-Slavic Natural Language Processing},
year={2015},
pages={17--23}
}

2 Dataset

Download the the coreference dataset from here: coref dataset.

The dataset contains 55 news articles with approximately 2600 manually annotated coreference relations. Only entity mentions participating in at least one coreference relations (i.e., having at least one other coreferent mention) are annotated in the dataset.

3 License

Creative Commons License

CroCoref - all files available by TakeLab, FER are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.