The Event Coreference Uncertainty Dataset

Version: 1.0
Release date: March 11, 2013

1 Description

The event coreference uncertainty dataset is the dataset annotated with event coreference confidence scores for pairs of event mentions.

The fraction of coreferential event mention pairs is negligible when compared to all possible event mention pairs for both within- and cross-document setting. In order to ensure a sufficient amount of coreferring event mention pairs, mentions were drawn from a limited number of topically related groups of documents. For this purpose we used a set of newspaper documents obtained through NewsBrief service of the European Media Monitor. NewsBrief groups the documents describing the same seminal events. 70 seminal event groups and 2-3 documents from each group were selected. From the chosen documents, the event anchors (using an in-house anchor extraction tool performing at F-score of 81%) and main event arguments were extracted. The final dataset consists of 1006 event mention pairs (437 within-document pairs and 569 cross-document pairs).

The dataset accompanies the following paper:
Glavaš G., Šnajder J.: Exploring Coreference Uncertainty of Generically Extracted Event Mentions. pp. 408-422. Springer, Samos, Greece (24-30 March 2013).

If you use the dataset, please cite the paper. The BibTeX format is:

  title={Exploring Coreference Uncertainty of Generically Extracted Event Mentions},
  author={Glava\v{s}, Goran and {\v{S}}najder, Jan},
  booktitle={Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics {(CICLing 2013)}},
  month = {24-30 March},
  address = {Samos, Greece},
  publisher = {Springer},
  pages = {408-422}

2 Dataset

Each mention pair is represented as an individual xml document with the following important elements:

3 License

This dataset is available under a derivative of a BSD-license that requires proper attribution. Essentially, you can use, modify and sell this dataset or the derived products in academic and non-academic, commercial and non-commercial, open-source and closed-source settings, but you have to give proper credit.