CroSemRel450 -- Word Semantic Relatedness Dataset for Croatian

Version: 1.0
Release date: December 16, 2013

1 Description

Determining the semantic relatedness of words is a standard task in distributional semantics. We provide a dataset for semantic relatedness task for Croatian language. The dataset is described in:

Janković, V., Šnajder, J., Dalbelo Bašić. (2011). Random Indexing Distributional Semantic Models for Croatian Language. Lecture Notes in Artificial Intelligence (Third Int. Workshop on Balto-Slavonic Natural Language Processing), 6836, 411–418.

If you use this dataset for your own work, please cite the above paper. The BibTeX citation is:

@inproceedings{jankovic2011random,
  title={Random indexing distributional semantic models for Croatian language},
  author={Jankovi{\'c}, Vedrana and {\v{S}}najder, Jan and Ba{\v{s}}i{\'c},
  Bojana Dalbelo},
  booktitle={Text, Speech and Dialogue},
  pages={411--418},
  year={2011},
  organization={Springer}
}

2 Dataset

The dataset is available from here: TakeLab-CroSemRel450.tar.gz. The archive contains three files:
Two files are provided:

CroSemRel450-12.txt

CroSemRel450-6.txt

Both files contain a list of 450 word pairs and the average similarity scores assigned by the human annotators. The first file contains the scores averaged over 12 annotators. The second file contains the scores averaged over a subset of 6 annotators for which the observed agreement was higher. Consult the above mentioned paper for details.

3 License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.