DerivBase.hr - A Large-Coverage Derivational Morphology Resource for Croatian
Version: 1.0
Release date: October 24, 2013
1 Description
DerivBase.hr is a a large-coverage morphological resource
for Croatian that groups lemmas into clusters
of derivationally related lemmas. The resource covers 98K lemmas acquired from hrWaC, large web corpus of Croatian compiled by Ljubešić and Erjavec (2011). DerivBase.hr is inspired by DErivBase, a similar resource for German.
For details, please check out the following paper:
Jan Šnajder.
DerivBase.hr: A High-Coverage Derivational Morphology Resource for
Croatian. In Proceedings of the 9th Language Resources and Evaluation
Conference (LREC 2014). In press.
2 Dataset
Download DerivBase.hr from here: DerivBase.hr.v1.tar.gz (~670KB, MD5 checksum: f70559675d396faf1998e4a013b1dfcc).
The archive contains two versions of DerivBase.hr: (1) a knowledge-based version induced using derivational patterns based on HOFM grammar for Croatian and
(2) an unsupervised version induced using string-similarity clustering.
We recommend that you use the knowledge-based version because it's quality is higher.
The knowledge-based version contains 55K derivational
clusters (avg. of 1.8 lemmas per cluster), of which 15K
(27%) are non-singleton clusters. The quality of the
knowledge-based version, as measured on a manually-annotated sample of lemma pairs, is P=80.8% and R=76.6%.
The annotated sample of 2000 lemma pairs, which was used for the evaluation, is included in the archive.
3 License
DerivBase.hr - all files available by TakeLab, FER are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
4 Acknowledgment
This work was supported by the Croatian Science Foundation under the grant "02.03/162: Derivational Semantic Models for Information Retrieval".