WikiWarsHr - Croatian Temporally Tagged Corpus of Historical Narratives

Version: 1.0
Release date: October 9, 2014

1 Description

WikiWarsHr is a corpus of historical narratives taken from the Croatian Wikipedia and temporally tagged with TIMEX3. The corpus consists of 22 articles, with 59,915 non-punctuation tokens and 1,440 tagged temporal expressions. WikiWarsHr is inspired by WikiWars, a similar resource for English.

For details, please check the following paper:

Skukan, L.,Glavaš, G.,Šnajder, J.(2014). HeidelTime.Hr: Extracting and Normalizing Temporal Expressions in Croatian. In Proceedings of the Ninth Language Technologies Conference, Ljubljana. Information Society, 99-103. [paper]

If you use this dataset for your own work, please cite the above paper. The BibTeX citation is:

@inproceedings{skukan2014heideltimehr,
  title={HeidelTime.Hr: Extracting and Normalizing Temporal Expressions in Croatian},
  author={Skuka, Luka and Glava\v{s}, Goran and {\v{S}}najder, Jan},
  booktitle={Proceedings of the Nineth Language Technologies Conference},
  pages={99-103},
  year={2014},
  organization={Information Society}
}

2 Dataset

The dataset is available from here: TakeLab-WikiWarsHr.tar.gz.

The archive contains two directories, each containing 22 files. The in/ directory contains untagged articles formatted as .sgm files. The keyinline/ directory contains the tagged instances of same files.

3 License

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.