MBTI9k — Corpus of Reddit comments and post labeled with MBTI personality types
Version: 1.0
Release date: June 6, 2018
1 Description
MBTI9k is a dataset of Reddit posts and comments labeled with MBTI personality types.
It consists of several datasets:
- Reddit posts labeled with MBTI types (366087 rows × 34 columns)
- Reddit comments labeled with MBTI types (22934193 rows × 21 columns)
Columns:
- u'name'
- u'author'
- u'author_flair_text'
- u'downs'
- u'created_utc'
-
u'subreddit_id'
- u'link_id'
- u'parent_id'
- u'score'
-
u'controversiality'
- u'gilded'
- u'id'
- u'subreddit'
- u'ups'
- u'type'
-
u'word_count'
- u'word_count_quoteless'
- u'quote_to_text_ratio'
-
u'is_mbti_related'
- u'comment'
- u'lang'
- MBTI9K - subset of comments of authors with more than 1000 words commented outside of MBTI-related subreddits (9149 rows × 7 columns
Columns:
- u'author'
- u'comment'
- u'type'
- u'subreddits_commented'
-
u'mbti_subreddits_commented'
- u'wc'
- u'comments_num'
- MBTI9K with extracted features (9111 rows × 33499 columns)
Columns:
- 'global':[7,10], #subreddits_commented, subreddits_commented_mbti, num_comments
- 'liwc':[10,103], #liwc
- 'word':[103,3938], #top1000 word ngram (1,2,3) per dimension based on chi2
- 'char':[3938,7243], #top1000 char ngrams (2,3) per dimension based on chi2
- 'sub':[7243,12228], #number of comments in each subreddit
- 'ent':[12228,12229], #entropy
- 'subtf':[12229,17214], #tf-idf on subreddits
- 'subcat':[17214,17249], #manually crafted subreddit categories
- 'lda50':[17249,17299], #50 LDA topics
- 'posts':[17299,17319], #posts statistics
- 'lda100':[17319,17419], #100 LDA topics
- 'psy':[17419,17443], #psycholinguistic features
- 'en':[17443,17444], #ratio of english comments
- 'ttr':[17444,17445], #type token ratio
- 'meaning':[17445,17447], #additional pyscholinguistic features
- 'time_diffs':[17447,17453], #commenting time diffs
- 'month':[17453,17465], #monthly distribution
- 'hour':[17465,17489], #hourly distribution
- 'day_of_week':[17489,17496], #daily distribution
- 'word_an':[17496,21496], #word ngrams selected by F-score
- 'word_an_tf':[21496,25496], #tf-idf ngrams selected by F-score
- 'char_an':[25496,29496], #char ngrams selected by F-score
- 'char_an_tf':[29496,33496], #tf-idf char ngrams selected by F-score
- 'brit_amer':[33496,33499], #british vs american english ratio
The dataset acquisition process is described in:
If you use the MBTI dataset for your own work, please cite the above paper. The BibTeX citation is:
@inproceedings{gjurkovic2018reddit,
title={Reddit: A Gold Mine for Personality Prediction},
author={Gjurkovi{\'c}, Matej and {\v{S}}najder, Jan},
booktitle={Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media},
pages={87--97},
month={June},
year={2018},
address={New Orleans, Louisiana, USA},
url={http://aclweb.org/anthology/W18-1112},
doi={10.18653/v1/W18-1112 },
publisher={Association for Computational Linguistics}
}
2 Dataset
The datasets are available on request. Please contact me at
matej.gjurkovic@fer.hr.
3 License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.