fHrWaC - Filtered Croatian Web Corpus (hrWaC)

Version: 1.0
Release date: July 27, 2013

1 Description

fHrWaC is a filtered version of hrWaC, Croatian web corpus compiled by Ljubešić and Erjavec (2011). In fHrWac, much of the non-textual content (e.g., code snippets and formatting structure), encoding errors, and foreign-language content is removed. fHrWaC is suitable for NLP tasks in which linguistic quality is more important than coverage (e.g., for parsing).

The filtering was done heuristically on a per-document and per-sentence basis. The exact parameter setting of the filtering procedure are deducible from the source code (see below). For details, please refer to the following paper:

Jan Šnajder, Sebastian Padó, Željko Agić (2013). Building and Evaluating a Distributional Memory for Croatian. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia: Association for Computational Linguistics, 784-789. [pdf]

Should you decide to use fHrWaC, please cite the paper. The BibTeX format is:

@InProceedings{snajder2013building,
title={Building and Evaluating a Distributional Memory for Croatian},
author={{\v S}najder, Jan and Pad{\'o}, Sebastian and Agi{\'c}, {\v Z}eljko},
booktitle={51st Annual Meeting of the Association for Computational Linguistics},
year={2013},
pages={784-789}
}

2 Dataset

Download the fHrWaC from here: fhrwac-parsed.conll.zip (9.3 GB, MD5 checksum: a22f43a392064a3c85586a4c9abdd0ca, uncompressed: 40GB MD5: ide4e996cba9352ffdce5d4025ff8080e).

fHrWaC is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

3 Filtering tool

You can download the filtering tool from here: hrwac-filter.hs (Haskell source code). To compile the source code, you need to install the GHC compiler, which you can get as a part of the Haskell Platform. Then issue:
ghc --make hrwac-filter.hs
To filter a tokenized and sentence-segmented file, issue:
./hrwac-filter input.txt > output.txt
The input file needs to be tokenized, one sentence per line, with documents enclosed within <document ...> ... </document> tags.