Datasets for Recognizing Identical Events using Graph Kernels
Version: 1.0
Release date: May 15, 2013
1 Description
Identifying news stories that discuss the same real-world event is important for news tracking and retrieval.
Most existing approaches rely on the traditional vector space model. Here we provide datasets for recognizing identical real-world events based on a structured, event-oriented
document representation. Documents are structured as graphs of event mentions and graph kernels are used to measure the similarity between document pairs.
The two datasets correspond to two experiments described in the following publication:
Glavaš G., Šnajder J.: Recognizing Identical Events with Graph Kernels.
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013),
pp. 1-7., Sofia, Bulgaria (4-9 August 2013).
If you use the dataset, please cite the publication.
The BibTex format is:
@inproceedings{glavas2013recognizing,
title={Recognizing Identical Events with Graph Kernels},
author={Glava\v{s}, Goran and {\v{S}}najder, Jan},
booktitle={Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics {(ACL 2013)}},
month = {4-9 August},
year={2013},
address = {Sofia, Bulgaria},
organization={Association for Computational Linguistics},
pages = {1--7}
}
2 Datasets
The first dataset consists of 10 clusters of news stories, each cluster representing a distinct real-world event. We refer to this dataset as the dataset for recognizing identical events.
The second dataset consists of 10 news stories, each of which we altered to obtain
two meaning-preserving (event-preserving) and two meaning-changing (event-shifting) paraphrases. We refer to this dataset as the dataset for event similarity ranking.
3 License
This dataset is available under a derivative of a BSD-license that requires proper attribution. Essentially, you can use, modify and sell this dataset or the derived products in academic and non-academic, commercial and non-commercial, open-source and closed-source settings, but you have to give proper credit.