330 research outputs found

    Probing Spurious Correlations in Popular Event-Based Rumor Detection Benchmarks

    Full text link
    As social media becomes a hotbed for the spread of misinformation, the crucial task of rumor detection has witnessed promising advances fostered by open-source benchmark datasets. Despite being widely used, we find that these datasets suffer from spurious correlations, which are ignored by existing studies and lead to severe overestimation of existing rumor detection performance. The spurious correlations stem from three causes: (1) event-based data collection and labeling schemes assign the same veracity label to multiple highly similar posts from the same underlying event; (2) merging multiple data sources spuriously relates source identities to veracity labels; and (3) labeling bias. In this paper, we closely investigate three of the most popular rumor detection benchmark datasets (i.e., Twitter15, Twitter16 and PHEME), and propose event-separated rumor detection as a solution to eliminate spurious cues. Under the event-separated setting, we observe that the accuracy of existing state-of-the-art models drops significantly by over 40%, becoming only comparable to a simple neural classifier. To better address this task, we propose Publisher Style Aggregation (PSA), a generalizable approach that aggregates publisher posting records to learn writing style and veracity stance. Extensive experiments demonstrate that our method outperforms existing baselines in terms of effectiveness, efficiency and generalizability.Comment: Accepted to ECML-PKDD 202

    Doctor of Philosophy

    Get PDF
    dissertationDue to the popularity of Web 2.0 and Social Media in the last decade, the percolation of user generated content (UGC) has rapidly increased. In the financial realm, this results in the emergence of virtual investing communities (VIC) to the investing public. There is an on-going debate among scholars and practitioners on whether such UGC contain valuable investing information or mainly noise. I investigate two major studies in my dissertation. First I examine the relationship between peer influence and information quality in the context of individual characteristics in stock microblogging. Surprisingly, I discover that the set of individual characteristics that relate to peer influence is not synonymous with those that relate to high information quality. In relating to information quality, influentials who are frequently mentioned by peers due to their name value are likely to possess higher information quality while those who are better at diffusing information via retweets are likely to associate with lower information quality. Second I propose a study to explore predictability of stock microblog dimensions and features over stock price directional movements using data mining classification techniques. I find that author-ticker-day dimension produces the highest predictive accuracy inferring that this dimension is able to capture both relevant author and ticker information as compared to author-day and ticker-day. In addition to these two studies, I also explore two topics: network structure of co-tweeted tickers and sentiment annotation via crowdsourcing. I do this in order to understand and uncover new features as well as new outcome indicators with the objective of improving predictive accuracy of the classification or saliency of the explanatory models. My dissertation work extends the frontier in understanding the relationship between financial UGC, specifically stock microblogging with relevant phenomena as well as predictive outcomes

    VIRAL TOPIC PREDICTION AND DESCRIPTION IN MICROBLOG SOCIAL NETWORKS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Knowledge Discovery and Complex Network Dynamics in Social Media Space

    Get PDF
    Pattern discovery and correlation in text data have been research hotbed in recent times. However, a composite model that captures patterns and correlations as a quantitative measure in social media space is yet to receive much research attention. The paper therefore analyzed social media data from Twitter about the 2014-FIFA World Cup both as lexical text and a complex network system. Quantitatively it is discovered that the 140 character upper bound in Twitter does not have negative impact on the formation of ideas. For as a lexical text, the following key statistics were confirmed: the distribution of the words in the corpus obeys a Zipf’s law, 3-character length words accounted for almost 22% of the corpus and the distribution of the article "the" also follows a Zipf’s or power-law. Moreover, the three most frequent terms related to the world cup event, that is (url, worldcup, rt) account for about 14.5% of the corpus. In particular, the corpus is modeled as a network,  where 12 V">  is the set of vocabularies in the corpus and  is the set of bigrams (two words phrases). An algorithm is developed and implemented in python to obtain the bigrams from the corpus. Using concepts from graph theory, the bigram network is analyzed and the results show compelling facts about text network. Firstly, all the characteristics of complex networks known in literature are observed in the bigram network. These include the degree distribution, which is observed to follow power-law with degree exponent  value of 2.14. Secondly, the average path length of words is observed to be 4.78, which is within the ”small world” categories. Thirdly, other complex network characteristics such as eigenvector and betweenness centralities metrics are observed within the bigram network both having weak power-law distributions as observed in other complex networks in literature. These findings call for the need to study the topological characteristics of text data and comparing their structural properties to that of known complex network metrics in literature. The results will be of great importance in studying complex systems. Also the application areas of these findings are numerous ranging from information retrieval, data compression to information security. To the best of our knowledge, this is the first work that studied the textual and topological structure of text from social media platform as a complex network and analyzed important topological properties of complex network on it. Keywords: complex network, bigram, media space, Twitter, information scienc

    Unsupervised learning on social data

    Get PDF

    Microbloggers’ motivations in participatory journalism: A cross-cultural study of America and China

    Get PDF
    This phenomenological study focuses on the motivations of participatory journalists contributing on microblogs such as Twitter and Weibo. Although online user behavior and motivations have been studied before, few studies have examined motivations of participatory journalists from their own perspective. Moreover, this study is one of the few to explore participatory journalists across different cultures (U.S. and China). The author conducted a total of 13 in-depth interviews with participatory journalists on microblogs from both countries and used a qualitative analysis method to identify the themes and patterns that emerged. Motivations such as earning respect, technology early adoption, self-expression, relationship building, self-enhancement, branding and image building, and financial gain were discussed. De-motivational factors such as time constraints and self-censorship were presented. Motivational differences between the two groups of participants, including what the microblog account represents and the role of participatory journalists, were explained by cultural differences collectivism versus individualism and power distance. Limitations and future research were also discussed
    corecore