DerivBase.hr - A Large-Coverage Derivational Morphology Resource for Croatian

Version: 1.0
Release date: October 24, 2013

1 Description

DerivBase.hr is a a large-coverage morphological resource for Croatian that groups lemmas into clusters of derivationally related lemmas. The resource covers 98K lemmas acquired from hrWaC, large web corpus of Croatian compiled by Ljubešić and Erjavec (2011). DerivBase.hr is inspired by DErivBase, a similar resource for German.

For details, please check out the following paper:

Jan Šnajder. DerivBase.hr: A High-Coverage Derivational Morphology Resource for Croatian. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC 2014). In press.

2 Dataset

Download DerivBase.hr from here: DerivBase.hr.v1.tar.gz (~670KB, MD5 checksum: f70559675d396faf1998e4a013b1dfcc).

The archive contains two versions of DerivBase.hr: (1) a knowledge-based version induced using derivational patterns based on HOFM grammar for Croatian and (2) an unsupervised version induced using string-similarity clustering. We recommend that you use the knowledge-based version because it's quality is higher. The knowledge-based version contains 55K derivational clusters (avg. of 1.8 lemmas per cluster), of which 15K (27%) are non-singleton clusters. The quality of the knowledge-based version, as measured on a manually-annotated sample of lemma pairs, is P=80.8% and R=76.6%. The annotated sample of 2000 lemma pairs, which was used for the evaluation, is included in the archive.

3 License

Creative Commons License

DerivBase.hr - all files available by TakeLab, FER are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

4 Acknowledgment

HRZZ

This work was supported by the Croatian Science Foundation under the grant "02.03/162: Derivational Semantic Models for Information Retrieval".