A sentiment based approach to pattern discovery and classification in social media

Abstract

Social media allows people to participate, express opinions, mediate their own content and interact with other users. As such, sentiment information has become an integral part of social media. This thesis presents a sentiment-based approach to analyse content and social relationships in social media.First, this thesis aims to construct building blocks for sentiment analysis in social media, using sentiment in the form of mood. To that end, the problem of supervised mood classification is investigated. This line of work provides insights into what features in a generic document classification problem can be transferred to a mood classification problem in social media. As data in social media is normally large scale, novel scalable feature sets are introduced for this task. In particular, a novel set of psycholinguistic features is proposed and validated, which does not require a supervised feature selection phase and can therefore be applied for mood analysis at a large scale. Next, under an unsupervised setting, this thesis explores the new problem of pattern discovery in social media using sentiment information. The result is the discovery of intrinsic patterns of moods, each of which can be considered as a group of moods similar to a basic emotion studied in psychology, and therefore providing valuable empirical evidence about the structure of human emotion in the social media domain in a data-driven approach.The second major contribution of this thesis explores the use of sentiment information conveyed in on-line social diaries for detection of real-world events in a large scale setting. In particular, this thesis introduces the novel concept of 'sentiment burst' and employs a stochastic model for detection, and subsequent extraction, of events in social media. The resultant model is a powerful bursty detection algorithm suitable for on-line deployment on ever-growing datasets such as social media. An additional contribution in this line of work is an effective method for evaluating and ranking events using Google Timeline. This offers an objective measure by which to evaluate event detection a topic that is largely under explored in the current literature due to a general lack of human groundtruth.Next, under an egocentric analysis, sentiment information is used to study the impact of the demographics and personalities of users on the messages they create. In particular, we examine how the age and social connectivity of on-line users correlate with the affective, topical and psycholinguistic features of the texts they author. Using a large, ground-truthed dataset of millions of users and on-line diaries, we investigate various important questions posed in social media analysis, psychology and sociology. For example, is there a difference with regard to topic, psycholinguistic features and mood in the messages written by old versus young users? What features are predictive of a user's personality? Of extraversion and introversion? Are there features that are predictive of influence? The results obtained by our sentiment-based approach are encouraging, do not require an expensive feature selection phase and thus suggest a new and promising approach for egocentric analysis in the social media domain.Finally, the sentiment information conveyed in media content is investigated with respect to the networking and interaction aspects of a social media system. Sentiment information is studied in parallel with two other common aspects of social media content: topics and linguistic styles. Sentiment information is proved in this thesis to provide additional insights into the process of community formation. It is also shown to be a powerful predictor of community membership for a message or a user at a lighter computational cost

    Similar works