fHrWaC is a filtered version of hrWaC, Croatian web corpus compiled by Ljubešić and Erjavec (2011). In fHrWac, much of the non-textual content (e.g., code snippets and formatting structure), encoding errors, and foreign-language content is removed. fHrWaC is suitable for NLP tasks in which linguistic quality is more important than coverage (e.g., for parsing).
The filtering was done heuristically on a per-document and per-sentence basis. The exact parameter setting of the filtering procedure are deducible from the source code (see below). For details, please refer to the following paper:
Jan Šnajder, Sebastian Padó, Željko Agić (2013). Building and Evaluating a Distributional Memory for Croatian. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia: Association for Computational Linguistics, 784-789. [pdf]
Should you decide to use fHrWaC, please cite the paper. The BibTeX format is:
@InProceedings{snajder2013building,
title={Building and Evaluating a Distributional Memory for Croatian},
author={{\v S}najder, Jan and Pad{\'o}, Sebastian and Agi{\'c}, {\v Z}eljko},
booktitle={51st Annual Meeting of the Association for Computational Linguistics},
year={2013},
pages={784-789}
}
fHrWaC is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
ghc --make hrwac-filter.hsTo filter a tokenized and sentence-segmented file, issue:
./hrwac-filter input.txt > output.txtThe input file needs to be tokenized, one sentence per line, with documents enclosed within <document ...> ... </document> tags.