PreTox — a Dataset for Preemptive Toxic Comment Detection on Wikipedia

Version: 1.0
Release date: September 24, 2019

This is the dataset used in the following paper:

Mladen Karan and Jan Šnajder (2019). Preemptive Toxic Language Detection in Wikipedia Comments Using Thread-Level Context. Proceedings of the Third Workshop on Abusive Language Online. 2019., Florence, Association for Computational Linguistics.

If you use this dataset for your own work, please cite the above paper. The BibTeX citation is:

@inproceedings{karan2019preemptive,
  title={Preemptive Toxic Language Detection in Wikipedia Comments Using Thread-Level Context},
  author={Karan, Mladen and {\v{S}}najder, Jan},
  booktitle={Proceedings of the Third Workshop on Abusive Language Online},
  pages={129--134},
  year={2019},
  address = {Florence},
  publisher = {Association for Computational Linguistics}
}

The dataset is available from here: pretox-wiki-data.tar.gz.

License


This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.