330 research outputs found
Probing Spurious Correlations in Popular Event-Based Rumor Detection Benchmarks
As social media becomes a hotbed for the spread of misinformation, the
crucial task of rumor detection has witnessed promising advances fostered by
open-source benchmark datasets. Despite being widely used, we find that these
datasets suffer from spurious correlations, which are ignored by existing
studies and lead to severe overestimation of existing rumor detection
performance. The spurious correlations stem from three causes: (1) event-based
data collection and labeling schemes assign the same veracity label to multiple
highly similar posts from the same underlying event; (2) merging multiple data
sources spuriously relates source identities to veracity labels; and (3)
labeling bias. In this paper, we closely investigate three of the most popular
rumor detection benchmark datasets (i.e., Twitter15, Twitter16 and PHEME), and
propose event-separated rumor detection as a solution to eliminate spurious
cues. Under the event-separated setting, we observe that the accuracy of
existing state-of-the-art models drops significantly by over 40%, becoming only
comparable to a simple neural classifier. To better address this task, we
propose Publisher Style Aggregation (PSA), a generalizable approach that
aggregates publisher posting records to learn writing style and veracity
stance. Extensive experiments demonstrate that our method outperforms existing
baselines in terms of effectiveness, efficiency and generalizability.Comment: Accepted to ECML-PKDD 202
Doctor of Philosophy
dissertationDue to the popularity of Web 2.0 and Social Media in the last decade, the percolation of user generated content (UGC) has rapidly increased. In the financial realm, this results in the emergence of virtual investing communities (VIC) to the investing public. There is an on-going debate among scholars and practitioners on whether such UGC contain valuable investing information or mainly noise. I investigate two major studies in my dissertation. First I examine the relationship between peer influence and information quality in the context of individual characteristics in stock microblogging. Surprisingly, I discover that the set of individual characteristics that relate to peer influence is not synonymous with those that relate to high information quality. In relating to information quality, influentials who are frequently mentioned by peers due to their name value are likely to possess higher information quality while those who are better at diffusing information via retweets are likely to associate with lower information quality. Second I propose a study to explore predictability of stock microblog dimensions and features over stock price directional movements using data mining classification techniques. I find that author-ticker-day dimension produces the highest predictive accuracy inferring that this dimension is able to capture both relevant author and ticker information as compared to author-day and ticker-day. In addition to these two studies, I also explore two topics: network structure of co-tweeted tickers and sentiment annotation via crowdsourcing. I do this in order to understand and uncover new features as well as new outcome indicators with the objective of improving predictive accuracy of the classification or saliency of the explanatory models. My dissertation work extends the frontier in understanding the relationship between financial UGC, specifically stock microblogging with relevant phenomena as well as predictive outcomes
VIRAL TOPIC PREDICTION AND DESCRIPTION IN MICROBLOG SOCIAL NETWORKS
Ph.DDOCTOR OF PHILOSOPH
Knowledge Discovery and Complex Network Dynamics in Social Media Space
Pattern discovery and correlation in text data have been research hotbed in recent times. However, a composite model that captures patterns and correlations as a quantitative measure in social media space is yet to receive much research attention. The paper therefore analyzed social media data from Twitter about the 2014-FIFA World Cup both as lexical text and a complex network system. Quantitatively it is discovered that the 140 character upper bound in Twitter does not have negative impact on the formation of ideas. For as a lexical text, the following key statistics were confirmed: the distribution of the words in the corpus obeys a Zipf’s law, 3-character length words accounted for almost 22% of the corpus and the distribution of the article "the" also follows a Zipf’s or power-law. Moreover, the three most frequent terms related to the world cup event, that is (url, worldcup, rt) account for about 14.5% of the corpus. In particular, the corpus is modeled as a network, where 12 V"> is the set of vocabularies in the corpus and is the set of bigrams (two words phrases). An algorithm is developed and implemented in python to obtain the bigrams from the corpus. Using concepts from graph theory, the bigram network is analyzed and the results show compelling facts about text network. Firstly, all the characteristics of complex networks known in literature are observed in the bigram network. These include the degree distribution, which is observed to follow power-law with degree exponent value of 2.14. Secondly, the average path length of words is observed to be 4.78, which is within the ”small world” categories. Thirdly, other complex network characteristics such as eigenvector and betweenness centralities metrics are observed within the bigram network both having weak power-law distributions as observed in other complex networks in literature. These findings call for the need to study the topological characteristics of text data and comparing their structural properties to that of known complex network metrics in literature. The results will be of great importance in studying complex systems. Also the application areas of these findings are numerous ranging from information retrieval, data compression to information security. To the best of our knowledge, this is the first work that studied the textual and topological structure of text from social media platform as a complex network and analyzed important topological properties of complex network on it. Keywords: complex network, bigram, media space, Twitter, information scienc
Microbloggers’ motivations in participatory journalism: A cross-cultural study of America and China
This phenomenological study focuses on the motivations of participatory journalists contributing on microblogs such as Twitter and Weibo. Although online user behavior and motivations have been studied before, few studies have examined motivations of participatory journalists from their own perspective. Moreover, this study is one of the few to explore participatory journalists across different cultures (U.S. and China). The author conducted a total of 13 in-depth interviews with participatory journalists on microblogs from both countries and used a qualitative analysis method to identify the themes and patterns that emerged. Motivations such as earning respect, technology early adoption, self-expression, relationship building, self-enhancement, branding and image building, and financial gain were discussed. De-motivational factors such as time constraints and self-censorship were presented. Motivational differences between the two groups of participants, including what the microblog account represents and the role of participatory journalists, were explained by cultural differences collectivism versus individualism and power distance. Limitations and future research were also discussed
- …