Cropinion - Opinion Mining from Croatian User Reviews Dataset

Version: 1.0
Release date: August 8, 2013

1 Description

The Opinion Mining from Croatian User Reviews Dataset is a dataset of user reviews annotated with linguistic data.
Reviews were downloaded from pauza.hr website. Spelling errors were corrected with GNU Aspell before annotation.
Language is simple, informal and domain specific.
The dataset accompanies the paper:

Goran Glavaš, Damir Korenčić, Jan Šnajder (2013). Aspect-Oriented Opinion Mining from User Reviews in Croatian.
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
Sofia: Association for Computational Linguistics, 2013.

The paper describes a method for aspect-based opinion mining.
First a lexicon of aspects (product features) and opinion clues (aspect attributes) is constructed.
Than pairing of aspects and clues is solved as a supervised classification problem.
Finally, prediction of overall review scores is performed using supervised classification and regression.
Among other features, the extracted aspect-clue pairs are used for score prediction.

Should you decide to use the dataset, please cite the paper. The BibTeX format is:

@InProceedings{glavas2013cropinion,
  title={Aspect-Oriented Opinion Mining from User Reviews in Croatian},
  author={Glava{\v s}, Goran and Koren{\v c}i{\' c}, Damir and {\v S}najder, Jan},
  booktitle={51st Annual Meeting of the Association for Computational Linguistics},
  year={2013},
  pages={in press}
}

2 Dataset

The data is organized as follows.

Dictionary

Dictionaries of product aspects, positive and negative opinion clues.

Original reviews

Preprocessed reviews

Each user review contains the original text, the review score and the URL of the original comment.
Text is segmented into spellchecked sentences and sentences are tokenized.
For each token, the following linguistic data is provided:

<Word>token</Word> 
<Lemma>word lemma</Lemma> 
<MolexLemmas> - list of (possible) lemmas constructed by MOLEX
	<string>lemma1</string>
	...
</MolexLemmas> 
<POSTag>part-of-speech tag</POSTag> 
<BasicStem>word stem</BasicStem> 
<MSDs> - list of morphosyntactic descriptors
	<string>descriptor1</string>
	...
<MSDs/>

After the sequence of tagged words, sequence of dependency relations follows. Each relation contains the following linguistic data:

<DependencyRelation>
    <Governor>
        <Word>governor word</Word>
	[same data as for sentence tokens]
    </Governor>
    <Dependent>
        <Word>dependent word</Word>
	[same data as for sentence tokens]
    </Dependent>
    <Relation>type of dependency relation</Relation>
</DependencyRelation>

For more details, please see the references provided in the paper at the beginning of Section 3.

Annotated pairs

This folder contains reviews with manually annotated aspect-clue pairs.
The data is used for training and testing the pairing classifier.
For each sentence, all possible pairs are listed. Where pairing exists, value of 'link' is set to '+'.

3 License

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License