The dataset
contains preprocessed news articles, a list of feed URLs from which
the articles were downloaded and a list of themes describing article topics.
The dataset accompanies the paper:
Damir Korenčić, Strahil Ristov, Jan Šnajder (2015). Getting the Agenda Right: Measuring Media Agenda using Topic Models
Proceedings of the Topic Models: Post-Processing and Applications Workshop
TM’15, October 19, 2015, Melbourne, Australia
The paper describes a method for measuring the media agenda (a set of issues of interest).
First a number of topic models is trained with the corpus articles, then the topics are labeled with themes,
conceptual topics that facilitate issue definition and measurment.
After that, a list of seed words is defined for each issue and customized topic models
with topics corresponding to the issues are built and used to tag the document with issues.
Finally, experiments with document taggers are performed. The experiments evaluate tagging
performance and compare taggers based on topic models with supervised taggers.
Should you decide to use the dataset, please cite the paper. The BibTeX format is:
@InProceedings{korencic2015agenda,
title={Getting the Agenda Right: Measuring Media Agenda using Topic Models Dataset},
author={Koren{\v c}i{\' c}, Damir and Ristov, Strahil and {\v S}najder, Jan},
booktitle={Topic Models: Post-Processing and Applications Workshop},
year={2015},
pages={in press}
}
ID: The corpus-unique article ID
TITLE: Original title from the news outlet
URL: URL used to download the article
DATE SAVED: Timestamp when the article was saved in the database, Central European Time.
STEMMED WORDS: A list of stems outputted by corpus preprocessing
Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License