Personality profiling has been utilised by companies for targeted
advertising, political campaigns and vaccine campaigns. However, the accuracy
and versatility of such models still remains relatively unknown. Consequently,
we aim to explore the extent to which peoples' online digital footprints can be
used to profile their Myers-Briggs personality type. We analyse and compare the
results of four models: logistic regression, naive Bayes, support vector
machines (SVMs) and random forests. We discover that a SVM model achieves the
best accuracy of 20.95% for predicting someones complete personality type.
However, logistic regression models perform only marginally worse and are
significantly faster to train and perform predictions. We discover that many
labelled datasets present substantial class imbalances of personal
characteristics on social media, including our own. As a result, we highlight
the need for attentive consideration when reporting model performance on these
datasets and compare a number of methods for fixing the class-imbalance
problems. Moreover, we develop a statistical framework for assessing the
importance of different sets of features in our models. We discover some
features to be more informative than others in the Intuitive/Sensory (p =
0.032) and Thinking/Feeling (p = 0.019) models. While we apply these methods to
Myers-Briggs personality profiling, they could be more generally used for any
labelling of individuals on social media.Comment: 8 pages, 6 figures. Dataset available at
https://figshare.com/articles/dataset/Self-Reported_Myers-Briggs_Personality_Types_on_Twitter/2362055