Using a set of over 70.000 records from PLOS One journal consisting of 37
lexical, sentiment and bibliographic variables we perform analysis backed with
machine learning methods to predict the class of popularity of scientific
papers defined by the number of times they have been viewed. Our study shows
correlations among the features and recovers a threshold for the number of
views that results in the best prediction results in terms of Matthew's
correlation coefficient. Moreover, by creating a variable importance plot for
random forest classifier, we are able to reduce the number of features while
keeping similar predictability and determine crucial factors responsible for
the popularity.Comment: 13 pages, 6 figure