Dataset accompanying the paper
Getting the Agenda Right: Measuring Media Agenda using Topic Models

Version: 1.0
Release date: August 15, 2015

1 Description

The dataset contains preprocessed news articles, a list of feed URLs from which the articles were downloaded and a list of themes describing article topics.
The dataset accompanies the paper:

Damir Korenčić, Strahil Ristov, Jan Šnajder (2015). Getting the Agenda Right: Measuring Media Agenda using Topic Models
Proceedings of the Topic Models: Post-Processing and Applications Workshop
TM’15, October 19, 2015, Melbourne, Australia

The paper describes a method for measuring the media agenda (a set of issues of interest).
First a number of topic models is trained with the corpus articles, then the topics are labeled with themes, conceptual topics that facilitate issue definition and measurment.
After that, a list of seed words is defined for each issue and customized topic models with topics corresponding to the issues are built and used to tag the document with issues.
Finally, experiments with document taggers are performed. The experiments evaluate tagging performance and compare taggers based on topic models with supervised taggers.

Should you decide to use the dataset, please cite the paper. The BibTeX format is:

@InProceedings{korencic2015agenda,
  title={Getting the Agenda Right: Measuring Media Agenda using Topic Models Dataset},
  author={Koren{\v c}i{\' c}, Damir and Ristov, Strahil and {\v S}najder, Jan},
  booktitle={Topic Models: Post-Processing and Applications Workshop},
  year={2015},
  pages={in press}
}

2 Dataset

The dataset consists of the following three parts.

Corpus

A set of news articles. Each article resides in a UTF-8 text file named article_ID.txt
Each file contains the following data:

ID: The corpus-unique article ID
TITLE: Original title from the news outlet
URL: URL used to download the article
DATE SAVED: Timestamp when the article was saved in the database, Central European Time.
STEMMED WORDS: A list of stems outputted by corpus preprocessing

The text preprocessing consisted of non-word and stop-word removal, followed by lemmatization and stemming.
After preprocessing, very frequent tokens (found in more than 10% of the documents) and infrequent tokens (found in less than five documents) were removed.

Themes

The list of themes with corresponding model topics. Each topic is encoded as uspolM[model_label].[topic_index].

Feeds

The list of XML feeds from which the article URLs were fetched.

3 License

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

Dataset accompanying the paper Getting the Agenda Right: Measuring Media Agenda using Topic Models