Search CORE

59 research outputs found

When is it Biased? Assessing the Representativeness of Twitter's Streaming API

Author: Liu Huan
Morstatter Fred
Pfeffer Jürgen
Publication venue
Publication date: 30/01/2014
Field of study

Twitter has captured the interest of the scientific community not only for its massive user base and content, but also for its openness in sharing its data. Twitter shares a free 1% sample of its tweets through the "Streaming API", a service that returns a sample of tweets according to a set of parameters set by the researcher. Recently, research has pointed to evidence of bias in the data returned through the Streaming API, raising concern in the integrity of this data service for use in research scenarios. While these results are important, the methodologies proposed in previous work rely on the restrictive and expensive Firehose to find the bias in the Streaming API data. In this work we tackle the problem of finding sample bias without the need for "gold standard" Firehose data. Namely, we focus on finding time periods in the Streaming API data where the trend of a hashtag is significantly different from its trend in the true activity on Twitter. We propose a solution that focuses on using an open data source to find bias in the Streaming API. Finally, we assess the utility of the data source in sparse data situations and for users issuing the same query from different regions

arXiv.org e-Print Archive

CiteSeerX

Debiasing Community Detection: The Importance of Lowly-Connected Nodes

Author: Mehrabi Ninareh
Morstatter Fred
Peng Nanyun
Galstyan Aram
Publication venue
Publication date: 01/10/1940
Field of study

Community detection is an important task in social network analysis, allowing us to identify and understand the communities within the social structures. However, many community detection approaches either fail to assign low degree (or lowly-connected) users to communities, or assign them to trivially small communities that prevent them from being included in analysis. In this work, we investigate how excluding these users can bias analysis results. We then introduce an approach that is more inclusive for lowly-connected users by incorporating them into larger groups. Experiments show that our approach outperforms the existing state-of-the-art in terms of F1 and Jaccard similarity scores while reducing the bias towards low-degree users

arXiv.org e-Print Archive

Trinity College

Finding Eyewitness Tweets During Crises

Author: Liu Huan
Lubold Nichola
Morstatter Fred
Pfeffer Jürgen
Pon-Barry Heather
Publication venue
Publication date: 01/01/2014
Field of study

Disaster response agencies have started to incorporate social media as a source of fast-breaking information to understand the needs of people affected by the many crises that occur around the world. These agencies look for tweets from within the region affected by the crisis to get the latest updates of the status of the affected region. However only 1% of all tweets are geotagged with explicit location information. First responders lose valuable information because they cannot assess the origin of many of the tweets they collect. In this work we seek to identify non-geotagged tweets that originate from within the crisis region. Towards this, we address three questions: (1) is there a difference between the language of tweets originating within a crisis region and tweets originating outside the region, (2) what are the linguistic patterns that can be used to differentiate within-region and outside-region tweets, and (3) for non-geotagged tweets, can we automatically identify those originating within the crisis region in real-time

arXiv.org e-Print Archive

CiteSeerX

Crossref

Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations for Cyberbullying Classification

Author: Morstatter Fred
Vigfusson Ymir
Ziems Caleb
Publication venue
Publication date: 03/04/2020
Field of study

Cyberbullying is a pervasive problem in online communities. To identify cyberbullying cases in large-scale social networks, content moderators depend on machine learning classifiers for automatic cyberbullying detection. However, existing models remain unfit for real-world applications, largely due to a shortage of publicly available training data and a lack of standard criteria for assigning ground truth labels. In this study, we address the need for reliable data using an original annotation framework. Inspired by social sciences research into bullying behavior, we characterize the nuanced problem of cyberbullying using five explicit factors to represent its social and linguistic aspects. We model this behavior using social network and language-based features, which improve classifier performance. These results demonstrate the importance of representing and modeling cyberbullying as a social phenomenon.Comment: 12 pages, 5 figures, 22 tables, Accepted to the 14th International AAAI Conference on Web and Social Media, ICWSM'2

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications