dm.hr - A Distributional Memory for Croatian

Version: 1.0
Release date: July 27, 2013

1 Description

Distributional memory (Baroni and Lenci, 2010) is a general framework for corpus-based semantics, which represents co-occurrence information as a tensor (a three-dimensional matrix) of weighted word-link-word tuples. Each tuple is associated with a score that reflects the strength of the association. By matricization, the tensor can be converted into matrices appropriate for various semantic tasks.

dm.hr is a distributional memory for Croatian, compiled from a dependency-parsed Croatian web corpus HrWaC, and covers about 2M lemmas. For details, please check out the following paper:

Jan Šnajder, Sebastian Padó, Željko Agić (2013). Building and Evaluating a Distributional Memory for Croatian. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia: Association for Computational Linguistics, 784-789. [pdf]

Should you decide to use dm.hr, please cite the paper. The BibTeX format is:

@InProceedings{snajder2013building,
  title={Building and Evaluating a Distributional Memory for Croatian},
  author={{\v S}najder, Jan and Pad{\'o}, Sebastian and Agi{\'c}, {\v Z}eljko},
  booktitle={51st Annual Meeting of the Association for Computational Linguistics},
  year={2013},
  pages={784-789}
}

2 Dataset

Download the dm.hr tensor from here: dm.hr.1.0.txt.gz (1.5 GB, uncompressed 5.1 GB, MD5 checksum: aa9935b9071271e91a109fb5bf541ddb).

The tensor contains 121,763,648 LMI-weighted tuples and 2,268,979 lemmas. The tensor was constructed from fHrWaC, a 51M sentences filtered version of the HrWaC corpus. For preprocessing, we used tagging, lemmatization, and parsing models available from here. For task-based evaluation, we used the synonym choice dataset for Croatian. Please check out the paper for details on tensor construction and evaluation.

3 License

dm.hr - all files available by TakeLab, FER are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

4 Acknowledgment

This work was supported by the Croatian Science Foundation under the grant "02.03/162: Derivational Semantic Models for Information Retrieval".