Social curation platforms like Reddit are rich with user interactions such as comments, upvotes, and downvotes. Predicting these interactions before they happen is an interesting computational challenge and can be used for a variety of tasks, ranging from content moderation to personality prediction. Given the vast amount of information posted on these sites, it\u27s important to develop models that can simplify this prediction task. In this paper, we present a simple clustering algorithm that helps predict the controversiality of a Reddit post using the user\u27s profile information, their past contributions on Reddit, and the sentiment expressed in their post. On average, introducing the cluster to the prediction task improved the accuracy of the prediction by over 20 percent, with F1 scores of 0.95 (micro) and 0.7 (macro). The classifier performs better than a majority predictor. The results also show that the overwhelming majority of users are inactive and when they do post, they post non-controversial content

Dara, Abenezer Daniel

English

Dartmouth Digital Commons (Dartmouth College)

Dartmouth College Dartmouth Digital Commons Dartmouth College Undergraduate Theses Theses and Dissertations 5-1-2020 A Clustering Algorithm for Early Prediction of Controversial Reddit Posts Abenezer Daniel Dara Dartmouth College Follow this and additional works at: https://digitalcommons.dartmouth.edu/senior_theses  Part of the Computer Sciences Commons Recommended Citation Dara, Abenezer Daniel, "A Clustering Algorithm for Early Prediction of Controversial Reddit Posts" (2020). Dartmouth College Undergraduate Theses. 157. https://digitalcommons.dartmouth.edu/senior_theses/157 This Thesis (Undergraduate) is brought to you for free and open access by the Theses and Dissertations at Dartmouth Digital Commons. It has been accepted for inclusion in Dartmouth College Undergraduate Theses by an authorized administrator of Dartmouth Digital Commons. For more information, please contact dartmouthdigitalcommons@groups.dartmouth.edu. Dartmouth Computer Science Technical Report TR2020-891  A CLUSTERING ALGORITHM FOR EARLY PREDICTION OF CONTROVERSIAL REDDIT POSTS A Thesis Submitted to the Faculty  in partial fulfillment of the requirement for the  degree of  Bachelor of Arts  in Computer Science  By Abenezer Daniel Dara Advisor: Professor Soroush Vosoughi   DARTMOUTH COLLEGE Hanover, New Hampshire May 2020   AbstractSocial curation platforms like Reddit are rich with user interactions such as com-ments, upvotes, and downvotes. Predicting these interactions before they happen isan interesting computational challenge and can be used for a variety of tasks, rangingfrom content moderation to personality prediction. Given the vast amount of infor-mation posted on these sites, it’s important to develop models that can simplify thisprediction task. In this paper, we present a simple clustering algorithm that helpspredict the controversiality of a Reddit post using the user’s profile information, theirpast contributions on Reddit, and the sentiment expressed in their post. On average,introducing the cluster to the prediction task improved the accuracy of the predictionby over 20 percent, with F1 scores of 0.95 (micro) and 0.7 (macro). The classifierperforms better than a majority predictor. The results also show that the overwhelm-ing majority of users are inactive and when they do post, they post non-controversialcontent.iiContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Controversial Posts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Data and Methodology 52.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 Organizing the Data . . . . . . . . . . . . . . . . . . . . . . . 52.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Prediction Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Results 104 Discussion 134.1 Limitation and Future Work . . . . . . . . . . . . . . . . . . . . . . . 13References 14iiiChapter 1IntroductionSection 1.1Problem StatementSocial media sites have become common opinion sharing platforms (Addawood andBashir 2016). As a result, academics have long been interested in understandingthe interactions between users on social media sites. From personality predictionto understanding how misinformation spreads online, understanding the interactionbetween social media users has been informative to researchers (Hessel et al. 2019;Addawood et al. 2016; Angeletou et al. 2011). In this paper, we present a clusteringalgorithm that helps detect the controversiality of posts on Reddit, a social newsaggregation discussion web site.Detecting these controversial posts has several advantages. Prior research hasshown that the controversiality of posts can be indicative of anti-social behavior bythe individual posting them (Smith et al. 2013). Detecting these posts early beforethey receive any comments can help identify content that needs to be moderated(Morrison and Hayes 2013). It can also be used to create a system where the socialmedia site can warn the user when they are about to post a controversial content,11.1 Problem Statement Introductiongiving them a chance to modify their post if that was not their intent. Furthermore,detecting controversiality can also be used to identify discussion threads where debateis happening. This can be used to sort “hot” discussion threads and engage the onlinecommunity in a discussion. In addition, detecting controversial posts can also helpauto-detect bullying behavior (Medvedev et al. 2017).However, predicting the responses a post receives is an expensive and complicatedcomputational task (Burlutskiy et al. 2016). Given the large volume of user activityon platforms such as Reddit, training models to recognize a pattern of activity isan expensive computational task (Burlutskiy et al. 2016; Lim et al. 2017). Ontop of that, we also need to account for context-specific variables. For instance,whether a particular post will be controversial or not may depend on the subredditthe post is made instead of the content of the post. For instance, posts on the topic ofbreakups are controversial on the relationships subreddit, while they do not generatecontroversy in the AskWomen subreddit (Hessel and Lee 2019). Moreover, otherresearchers have shown that the controversiality of a post is determined by the firstfew comments it receives instead of the original content of the post (Chang and Mizil2019; Hessel and Lee 2019).This paper presents a simple clustering algorithm for the detection of controversialposts well before they receive any responses from the subreddit community. Thealgorithm is trained on data from over 700,000 Reddit users in 50 subreddits forthe period spanning 2004-2018. We selected Reddit because it is one of the mostcommonly used social media networks and ranks in the top 10 most visited websitesin the US 1. And since signing up for Reddit does not require an email address,it attracts a large number of users, giving us more content to work with. Becauseall posts on Reddit are public, we can download all submissions made within the1alexa.com/siteinfo/reddit.com21.2 Controversial Posts Introductiontimeframe of our interest as long as they are not deleted. Reddit is divided into“communities” known as subreddits, where users talk about a specific topic relatedto the subreddit. For this paper, we will focus on 50 subreddits where Redditorsdiscuss economics and finance 2.Section 1.2Controversial PostsPrior researchers have defined controversial posts as those that cause polarization,receiving both significant positive and negative comments (Hessel and Lee 2019).These posts are more likely to invited heated debate, attracting responses that spana wide variety of emotions. The controversiality of a post could be measured in severalways.One common method is to examine the sentiment expressed in the post and thecomments it attracts. In several studies, researchers have used sentiment analysis topredict if comments for a post are supportive, neutral, or against the post (Smithet al 2013). The researchers used these categorizations of comments to calculate thecontroversiality of the post. However, this method fails to capture controversialitywhen it is not expressed in words. Upvotes and downvotes a post receives are alsostrong indicators of “community” opinion, but this method fails to capture them. Ontop of that, many replies to a post can be links to outside sites or memes that do notlend themselves easily for sentiment analysis programs.As a result, we need to take into account upvotes and downvotes to measure2Subreddits examined: r/finance,r/economy, r/AskEconomics, r/jobs, r/workonline,r/forhire,r/PersonalFinance, r/Entrepreneur, r/startups, r/financialindependence, r/realestate, r/flipping,r/antimlm, r/ripple, r/Iota, r/stellar, r/investing, r/wallstreetbets, r/millionairemakers,r/weedstocks, r/frugal, r/EatCheapAndHealthy,r/frugalmalefashion, r/budgetfood,r/cheap meals,r/Frugal Jerk, r/povertyfinance, r/shutupandtakemymoney, r/BuyItForLife,r/crappyo↵brands, r/shouldibuythisgame, r/Anticonsumption, r/sbubby, r/Wellworn, r/ineeeedit,r/didntknowiwantedthat, r/Bitcoin, r/dogecoin, r/CryptoCurrency, r/ethereum, r/ethtrade,r/litecoin, r/btc, r/garlicoin, r/cardano, r/Vechain31.2 Controversial Posts Introductionthe controversiality of posts. However, Reddit no longer provides upvote/downvotecounts for posts. Instead, it provides the upvote and downvote count for commentsmade under posts. We will use these to calculate the controversiality score for eachcomment and use it as a proxy to measure the controversiality of a post. The formulafor controversiality is given by (u+d)min(ud ,du ), where u and d correspond to the numberof upvotes and downvotes respectively.The data we obtained shows that the vast majority of posts are not controversial.There are two main reasons for this. First, the vast majority of posts on Redditdo not get any replies. And second, the Reddit algorithm shows the most popularcontent to users on the first pages of the subreddits. Because controversial posts arenot categorized in the popular category, they are made less visible to users (Morrisonand Hayes 2013).4Chapter 2Data and MethodologySection 2.1DataWe downloaded posts and comments from 50 subreddits spanning the period between2004 and 2018. The data was obtained from Google’s BigQuery database 1. Useraccount data was obtained using Reddit’s PRAW API. Our dataset contains 718,732users.2.1.1. Organizing the DataTo organize the data, we represented each user as a 58-dimensional vector. Thefirst eight dimensions contained the information we collected about each user usingReddit’s PRAW API. The fields are given below 2:1https //console.cloud.google.com/bigqueryproject=fhbigquery&p=fh-bigquerydr¯eddit&paged¯ataset2https://praw.readthedocs.io/en/latest/code overview/models/redditor.html52.1 Data Data and MethodologyTable 2.1: Data collected for each userSome accounts from the initial dataset were deleted. As a result, we could notobtain any of the above data for them. To account for this, we added another feature(hasPrawData) that indicates whether or not we were able to download the user’s in-formation. The next 50 dimensions captured the user’s contribution to the subredditswe are working with. Here, each field represented the total number of times the usercommented or posted in each of our subreddits. Prior research has shown that usercontribution captures important qualities of Redditors such as their level of expertiseon the topics discussed in their subreddit communities (Lim et al. 2017).To reduce the dimension of the data, we employ Principal Component Analysis(PCA), an unsupervised feature extraction method that allows us to reduce the di-mension of a given data by identifying a smaller number of variables that summarizeour large dataset 3. By initializing a PCA with 58 components, we observe that thefirst two components explain more than 99.99% of the covariance in our data. Con-sequently, using a two-component PCA, we reduce our data to be two-dimensional.3https://www.sciencedirect.com/topics/medicine-and-dentistry/principal-component-analysis62.2 Clustering Data and MethodologySection 2.2ClusteringOnce we obtained our two-dimensional data, we used a K-means classifier to clusterthe users into separate buckets. K-means is an unsupervised classification algorithmthat partitions a given data into k distinct, non-overlapping clusters. These clustersminimize the sum of squared distances between data points and the mean of the datapoints in the cluster while maximizing the distance between the mean of each cluster4. We then performed the elbow method to determine the number of clusters for aK-means clustering (k=3) on the two-dimensional data we created. The clusteringpartitioned the users into three buckets that represented 37%, 10%, and 53% of theusers in the initial dataset.The clustering partitioned the users into the following three buckets:Figure 2.1: Number of users per cluster4https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a72.2 Clustering Data and MethodologyFigure 2.2: Percentage of users in each bucket by user featureThe Clusters. The descriptive statistics show that Redditors in the first cluster arethe vast majority of the users who have verified their email, have a high reputation(karma) score, and are moderators or gold users. On the other hand, the secondcluster contains Redditors whose accounts have been deleted or suspended by Reddit.Even though their accounts were deleted or suspended, their contributions representedabout 10% of our dataset, therefore we decided to keep them in our analysis. Finally,the third cluster contains the rest of the users, and this is where the majority of theusers fall.Using the same method as above, we further partitioned each bucket into threeclusters. We reduced the dimensionality using top Principal Component Analysisthat accounted for 99.9% of the variance in the data. We then run K-means (k=3)to obtain nine clusters for the entire user data.82.3 Prediction Task Data and MethodologySection 2.3Prediction TaskAfter obtaining the controversiality score for each post, we turn to the prediction task.To detect the sentiment expressed in each post, we used Vader sentiment analysis thatis specifically attuned to sentiments expressed in social media posts and comments.We obtain the scores for how negative, positive, or neutral a certain text is. For eachpost in our dataset, we used Vader to calculate a single compound score between-1 (very negative) and 1 (very positive) indicating the intensity and direction of thesentiment expressed in the post. Posts that did not contain any words were notincluded in this step.Reddit no longer provides the number of downvotes on posts. Therefore, to de-tect the controversiality of posts for our training data, we measured the controversygenerated in the comments section as a proxy. We were able to obtain the upvotesand downvotes for each comment, which allowed us to calculate the controversialityscore using the following formula: (u + d)min(ud ,du ), where u and d correspond to thenumber of upvotes and downvotes respectively.9Chapter 3ResultsFor our baseline, we train a Gaussian Na¨ıve Bayes (GNB) classifier to predict thecontroversiality score of the Reddit posts using the sentiment scores from Vader.GNB is a variation of the Na¨ıve Bayes classification algorithm that can be applied tocontinuous and normally distributed data. The algorithm computes conditional classprobabilities and predicts the most likely value of the target feature, in our case thecontroversiality score 1. For our second prediction task, we add the nine user clustersobtained from our previous computation and retrained the Gaussian model. We thenmeasured if adding the buckets improves the quality of our prediction.Our results show an improvement in F1 scores with the inclusion of the nine userclusters. On average, we recorded an improvement in the Macro F1 score from 0.5 to0.7, and the Micro F1 improved from 0.86 to 0.95. The Micro F1 score starts froma higher value because of the imbalance in the data; The overwhelming majority ofposts in our dataset (> 90%) have very low controversiality scores and received fewerthan 10 comments.1https //towardsdatascience.com/naive-bayesclassifier-explained-50f9723571ed10Results ResultsFigure 3.1: Micro F1 Score Improvements before and after including the user clusters11Results ResultsFigure 3.2: Macro F1 Score Improvements before and after including the user clusters12Chapter 4DiscussionSection 4.1Limitation and Future WorkWe show how to more accurately predict the controversiality of Reddit posts byclustering users into di↵erent buckets based on their prior contribution to Redditand basic information from their user accounts. Including these user clusters in ourprediction task improved the F1 score of our controversiality prediction, indicatingthat the clusters capture important qualities about the users. Further work canexplore other characteristics that these clusters capture. For instance, predicting thepopularity of users or predicting posts that are likely to be ignored.However, the work has several limitations that future work needs to address. Manyposts are links to other websites or just images which do not lend themselves easilyto sentiment analysis. Those posts were not used in our model due to this limitation.In addition, further work needs to account for the technical challenges of mappingcomment replies to the original posts when the parent comments are deleted. Wehad to exclude these replies from our dataset. Additionally, increasing the number ofbuckets could improve the prediction quality and needs to be explored.13 14  Addawood, Aseel, and Masooda Bashir. “What is Your Evidence?” A Study of Controversial Topics on Social Media." Proceedings of the Third Workshop on Argument Mining (ArgMining2016). 2016. Alexa. https://www.alexa.com/siteinfo/reddit.com#section_traffic Angeletou, Sofia, Matthew Rowe, and Harith Alani. "Modelling and analysis of user behaviour in online communities." International Semantic Web Conference. Springer, Berlin, Heidelberg, 2011. Burlutskiy, Nikolay, et al. "An investigation on online versus batch learning in predicting user behaviour." International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, Cham, 2016. Chang, Jonathan P., and Cristian Danescu-Niculescu-Mizil. "Trouble on the Horizon: Forecasting the Derailment of Online Conversations as they Develop." arXiv preprint arXiv:1909.01362 (2019). Hessel, Jack, and Lillian Lee. "Something's Brewing! Early Prediction of Controversy-causing Posts from Discussion Features." arXiv preprint arXiv:1904.07372 (2019). Lim, Wern Han, Mark James Carman, and Sze-Meng Jojo Wong. "Estimating relative user expertise for content quality prediction on Reddit." Proceedings of the 28th ACM Conference on Hypertext and Social Media. 2017.  15 Medvedev, Alexey N., Renaud Lambiotte, and Jean-Charles Delvenne. "The anatomy of Reddit: An overview of academic research." Dynamics on and of Complex Networks. Springer, Cham, 2017. Morrison, Donn, and Conor Hayes. "Here, have an upvote: Communication behaviour and karma on Reddit." INFORMATIK 2013–Informatik angepasst an Mensch, Organisation und Umwelt (2013). Smith, Laura M., et al. "The role of social media in the discussion of controversial topics." 2013 International Conference on Social Computing. IEEE, 2013.  

A Clustering Algorithm for Early Prediction of Controversial Reddit Posts

https://digitalcommons.dartmouth.edu/cgi/viewcontent.cgi?article=1156&amp;context=senior_theses

A Clustering Algorithm for Early Prediction of Controversial Reddit Posts

Abstract

Similar works

Full text

Available Versions

Dartmouth Digital Commons (Dartmouth College)