CroMWEsc - Semantic Compositionality of Croatian Multiword Expressions Dataset

Version: 1.0
Release date: October 9, 2014

1 Description

Determining the semantic compositionality of multiword expression is important for many tasks in natural language processing. We provide a small dataset of Croatian MWEs with human-annotated semantic compositionality scores. The construction of the dataset is described in:

Jan Šnajder and Petra Almić (2015). Modeling Semantic Compositionality of Croatian Multiword Expressions}. Informatica, 39 (3), 301-309. [paper]

If you use this dataset for your own work, please cite the above paper. The BibTeX citation is:

@article{snajder2015modeling,
  title={Modeling Semantic Compositionality of Croatian Multiword Expressions},
  author={{\v{S}}najder, Jan and Almi{\'c}, Petra},
  journal={Informatica},
  volume={39},
  number={3},
  pages={301-309},
  year={2015}
}

2 Dataset

The dataset is available from here: TakeLab-CroMWEsc.tar.gz.

The archive contains one file, which contains a list of 200 Croatian multiword expressions annotated with semantic compositionality scores. Twenty expressions were annotated by 24 annotators (denoted by "*") and the rest of them were annotated by 6 annotators. Besides median, we provide mode, mean, and standard deviation for each expression. Consult the above mentioned paper for details.

3 License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.