59 research outputs found
When is it Biased? Assessing the Representativeness of Twitter's Streaming API
Twitter has captured the interest of the scientific community not only for
its massive user base and content, but also for its openness in sharing its
data. Twitter shares a free 1% sample of its tweets through the "Streaming
API", a service that returns a sample of tweets according to a set of
parameters set by the researcher. Recently, research has pointed to evidence of
bias in the data returned through the Streaming API, raising concern in the
integrity of this data service for use in research scenarios. While these
results are important, the methodologies proposed in previous work rely on the
restrictive and expensive Firehose to find the bias in the Streaming API data.
In this work we tackle the problem of finding sample bias without the need for
"gold standard" Firehose data. Namely, we focus on finding time periods in the
Streaming API data where the trend of a hashtag is significantly different from
its trend in the true activity on Twitter. We propose a solution that focuses
on using an open data source to find bias in the Streaming API. Finally, we
assess the utility of the data source in sparse data situations and for users
issuing the same query from different regions
Debiasing Community Detection: The Importance of Lowly-Connected Nodes
Community detection is an important task in social network analysis, allowing
us to identify and understand the communities within the social structures.
However, many community detection approaches either fail to assign low degree
(or lowly-connected) users to communities, or assign them to trivially small
communities that prevent them from being included in analysis. In this work, we
investigate how excluding these users can bias analysis results. We then
introduce an approach that is more inclusive for lowly-connected users by
incorporating them into larger groups. Experiments show that our approach
outperforms the existing state-of-the-art in terms of F1 and Jaccard similarity
scores while reducing the bias towards low-degree users
Finding Eyewitness Tweets During Crises
Disaster response agencies have started to incorporate social media as a
source of fast-breaking information to understand the needs of people affected
by the many crises that occur around the world. These agencies look for tweets
from within the region affected by the crisis to get the latest updates of the
status of the affected region. However only 1% of all tweets are geotagged with
explicit location information. First responders lose valuable information
because they cannot assess the origin of many of the tweets they collect. In
this work we seek to identify non-geotagged tweets that originate from within
the crisis region. Towards this, we address three questions: (1) is there a
difference between the language of tweets originating within a crisis region
and tweets originating outside the region, (2) what are the linguistic patterns
that can be used to differentiate within-region and outside-region tweets, and
(3) for non-geotagged tweets, can we automatically identify those originating
within the crisis region in real-time
Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations for Cyberbullying Classification
Cyberbullying is a pervasive problem in online communities. To identify
cyberbullying cases in large-scale social networks, content moderators depend
on machine learning classifiers for automatic cyberbullying detection. However,
existing models remain unfit for real-world applications, largely due to a
shortage of publicly available training data and a lack of standard criteria
for assigning ground truth labels. In this study, we address the need for
reliable data using an original annotation framework. Inspired by social
sciences research into bullying behavior, we characterize the nuanced problem
of cyberbullying using five explicit factors to represent its social and
linguistic aspects. We model this behavior using social network and
language-based features, which improve classifier performance. These results
demonstrate the importance of representing and modeling cyberbullying as a
social phenomenon.Comment: 12 pages, 5 figures, 22 tables, Accepted to the 14th International
AAAI Conference on Web and Social Media, ICWSM'2
- …