Determining the semantic compositionality of multiword expression is important for many tasks in natural language processing. We provide a small dataset of Croatian MWEs with human-annotated semantic compositionality scores. The construction of the dataset is described in:
Jan Šnajder and Petra Almić (2015). Modeling Semantic Compositionality of Croatian Multiword Expressions}. Informatica, 39 (3), 301-309. [paper]
If you use this dataset for your own work, please cite the above paper. The BibTeX citation is:
@article{snajder2015modeling, title={Modeling Semantic Compositionality of Croatian Multiword Expressions}, author={{\v{S}}najder, Jan and Almi{\'c}, Petra}, journal={Informatica}, volume={39}, number={3}, pages={301-309}, year={2015} }
The dataset is available from here: TakeLab-CroMWEsc.tar.gz.
The archive contains one file, which contains a list of 200 Croatian multiword expressions annotated with semantic compositionality scores. Twenty expressions were annotated by 24 annotators (denoted by "*") and the rest of them were annotated by 6 annotators. Besides median, we provide mode, mean, and standard deviation for each expression. Consult the above mentioned paper for details.