Event-Cenetered Simplification of News

Version: 1.0
Release date: August 13, 2013

1 Description

For non-native speakers, people with low literacy or intellectual disabilities, and language-impaired people (e.g., autistic, aphasic, congenitally deaf) newswire texts are difficult to comprehend. Making news equally accessible to people with reading disabilities helps their integration into society. This work proposes an event-centered simplification of news stories that performs content reduction at the semantic level, unlike most previous approaches to text simplification that dominantly rely on simplifications at the lexical or syntactic level. The datasets provided here accompany the following publication:
Glavaš G., Štajner S.: Event-Centered Simplification of News Stories.
Proceedings of the Student Research Workshop at the International Conference on Recent Advances in Natural Language Processing (RANLP 2013),
Hissar, Bulgaria (7-13 September 2013).

If you use the dataset, please cite the publication. The BibTex format is:

@inproceedings{glavas2013recognizing,
  title={Event-Centered Simplification of News Stories},
  author={Glava\v{s}, Goran and {\v{S}}tajner, Sanja},
  booktitle={Proceedings of the Student Research Workshop at the International Conference on Recent Advances in Natural Language Processing {(RANLP 2013)}},
  month = {7-13 September},
  year={2013},
  address = {Hissar, Bulgaria},
}

2 Datasets

The archive with two datasets can be downloaded from here. The two available datasets are as follows:
1. Document-level dataset used for automated evaluation of the readability of simplified text (DocumentsReadability.rar) 2. Sentence-level dataset with human judgements of grammaticality, meaning, and simplicity (SentencesHumanAnnotation.rar)

Document-level datasets consists of 100 news stories. Besides the original text, for each story we provide four simplified versions corresponding to the baseline and three proposed simplification schemes (baseline, sentence, basic-event-wise, and entitycoref-pronominal-anaphora).

Sentence-level dataset contains 280 pairs of original and simplified sentences annotated with human marks of grammaticality, meaning preservation, and simplicity for the simplified version. Each pair of the original and simplified text has the following information: 1. Pair ID (not relevant) 2. Group ID (not relevant) 3. Method with which th simplification was obtained (B - baseline, S - sentence-wise, E - event-wise, C - pronominal anaphora) 4. The original text 5. The simplified text 6. Grammaticality score (assigned by human) 7. Score for meaning preservation (assigned by human) 8. Score for simplicity (assigned by human)

3 License

Datasets for Event-Cenetered Simplification of News Stories by TakeLab (in collaboration with the University of Wolverhamtpon) is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.