1 Description

MBTI9k is a dataset of Reddit posts and comments labeled with MBTI personality types. It consists of several datasets:

Reddit posts labeled with MBTI types (366087 rows × 34 columns)

Reddit comments labeled with MBTI types (22934193 rows × 21 columns)

Columns:

u'name'
u'author'
u'author_flair_text'
u'downs'
u'created_utc'
u'subreddit_id'
u'link_id'
u'parent_id'
u'score'
u'controversiality'
u'gilded'
u'id'
u'subreddit'
u'ups'
u'type'
u'word_count'
u'word_count_quoteless'
u'quote_to_text_ratio'
u'is_mbti_related'
u'comment'
u'lang'

MBTI9K - subset of comments of authors with more than 1000 words commented outside of MBTI-related subreddits (9149 rows × 7 columns

Columns:

u'author'
u'comment'
u'type'
u'subreddits_commented'
u'mbti_subreddits_commented'
u'wc'
u'comments_num'

MBTI9K with extracted features (9111 rows × 33499 columns)

Columns:

'global':[7,10], #subreddits_commented, subreddits_commented_mbti, num_comments
'liwc':[10,103], #liwc
'word':[103,3938], #top1000 word ngram (1,2,3) per dimension based on chi2
'char':[3938,7243], #top1000 char ngrams (2,3) per dimension based on chi2
'sub':[7243,12228], #number of comments in each subreddit
'ent':[12228,12229], #entropy
'subtf':[12229,17214], #tf-idf on subreddits
'subcat':[17214,17249], #manually crafted subreddit categories
'lda50':[17249,17299], #50 LDA topics
'posts':[17299,17319], #posts statistics
'lda100':[17319,17419], #100 LDA topics
'psy':[17419,17443], #psycholinguistic features
'en':[17443,17444], #ratio of english comments
'ttr':[17444,17445], #type token ratio
'meaning':[17445,17447], #additional pyscholinguistic features
'time_diffs':[17447,17453], #commenting time diffs
'month':[17453,17465], #monthly distribution
'hour':[17465,17489], #hourly distribution
'day_of_week':[17489,17496], #daily distribution
'word_an':[17496,21496], #word ngrams selected by F-score
'word_an_tf':[21496,25496], #tf-idf ngrams selected by F-score
'char_an':[25496,29496], #char ngrams selected by F-score
'char_an_tf':[29496,33496], #tf-idf char ngrams selected by F-score
'brit_amer':[33496,33499], #british vs american english ratio

Matej Gjurković and Jan Šnajder (2018). Reddit: A Gold Mine for Personality Prediction . Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media.

If you use the MBTI dataset for your own work, please cite the above paper. The BibTeX citation is:

@inproceedings{gjurkovic2018reddit, title={Reddit: A Gold Mine for Personality Prediction}, author={Gjurkovi{\'c}, Matej and {\v{S}}najder, Jan}, booktitle={Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media}, pages={87--97}, month={June}, year={2018}, address={New Orleans, Louisiana, USA}, url={http://aclweb.org/anthology/W18-1112}, doi={10.18653/v1/W18-1112 }, publisher={Association for Computational Linguistics} }

MBTI9k — Corpus of Reddit comments and post labeled with MBTI personality types

1 Description

2 Dataset

3 License