218 research outputs found
When is it Biased? Assessing the Representativeness of Twitter's Streaming API
Twitter has captured the interest of the scientific community not only for
its massive user base and content, but also for its openness in sharing its
data. Twitter shares a free 1% sample of its tweets through the "Streaming
API", a service that returns a sample of tweets according to a set of
parameters set by the researcher. Recently, research has pointed to evidence of
bias in the data returned through the Streaming API, raising concern in the
integrity of this data service for use in research scenarios. While these
results are important, the methodologies proposed in previous work rely on the
restrictive and expensive Firehose to find the bias in the Streaming API data.
In this work we tackle the problem of finding sample bias without the need for
"gold standard" Firehose data. Namely, we focus on finding time periods in the
Streaming API data where the trend of a hashtag is significantly different from
its trend in the true activity on Twitter. We propose a solution that focuses
on using an open data source to find bias in the Streaming API. Finally, we
assess the utility of the data source in sparse data situations and for users
issuing the same query from different regions
Debiasing Community Detection: The Importance of Lowly-Connected Nodes
Community detection is an important task in social network analysis, allowing
us to identify and understand the communities within the social structures.
However, many community detection approaches either fail to assign low degree
(or lowly-connected) users to communities, or assign them to trivially small
communities that prevent them from being included in analysis. In this work, we
investigate how excluding these users can bias analysis results. We then
introduce an approach that is more inclusive for lowly-connected users by
incorporating them into larger groups. Experiments show that our approach
outperforms the existing state-of-the-art in terms of F1 and Jaccard similarity
scores while reducing the bias towards low-degree users
Finding Eyewitness Tweets During Crises
Disaster response agencies have started to incorporate social media as a
source of fast-breaking information to understand the needs of people affected
by the many crises that occur around the world. These agencies look for tweets
from within the region affected by the crisis to get the latest updates of the
status of the affected region. However only 1% of all tweets are geotagged with
explicit location information. First responders lose valuable information
because they cannot assess the origin of many of the tweets they collect. In
this work we seek to identify non-geotagged tweets that originate from within
the crisis region. Towards this, we address three questions: (1) is there a
difference between the language of tweets originating within a crisis region
and tweets originating outside the region, (2) what are the linguistic patterns
that can be used to differentiate within-region and outside-region tweets, and
(3) for non-geotagged tweets, can we automatically identify those originating
within the crisis region in real-time
Identifying and Analyzing Cryptocurrency Manipulations in Social Media
Interest surrounding cryptocurrencies, digital or virtual currencies that are
used as a medium for financial transactions, has grown tremendously in recent
years. The anonymity surrounding these currencies makes investors particularly
susceptible to fraud---such as "pump and dump" scams---where the goal is to
artificially inflate the perceived worth of a currency, luring victims into
investing before the fraudsters can sell their holdings. Because of the speed
and relative anonymity offered by social platforms such as Twitter and
Telegram, social media has become a preferred platform for scammers who wish to
spread false hype about the cryptocurrency they are trying to pump. In this
work we propose and evaluate a computational approach that can automatically
identify pump and dump scams as they unfold by combining information across
social media platforms. We also develop a multi-modal approach for predicting
whether a particular pump attempt will succeed or not. Finally, we analyze the
prevalence of bots in cryptocurrency related tweets, and observe a significant
increase in bot activity during the pump attempts.Comment: Section 4. Prediction tasks: The training setup and algorithm
revised. The details of the training algorithm added. More features added to
the feature set. Section 5. Botometer score added as the likelihood of a user
being bot. More analysis added on bot activity in cluster
Feature Selection: A Data Perspective
Feature selection, as a data preprocessing strategy, has been proven to be
effective and efficient in preparing data (especially high-dimensional data)
for various data mining and machine learning problems. The objectives of
feature selection include: building simpler and more comprehensible models,
improving data mining performance, and preparing clean, understandable data.
The recent proliferation of big data has presented some substantial challenges
and opportunities to feature selection. In this survey, we provide a
comprehensive and structured overview of recent advances in feature selection
research. Motivated by current challenges and opportunities in the era of big
data, we revisit feature selection research from a data perspective and review
representative feature selection algorithms for conventional data, structured
data, heterogeneous data and streaming data. Methodologically, to emphasize the
differences and similarities of most existing feature selection algorithms for
conventional data, we categorize them into four main groups: similarity based,
information theoretical based, sparse learning based and statistical based
methods. To facilitate and promote the research in this community, we also
present an open-source feature selection repository that consists of most of
the popular feature selection algorithms
(\url{http://featureselection.asu.edu/}). Also, we use it as an example to show
how to evaluate feature selection algorithms. At the end of the survey, we
present a discussion about some open problems and challenges that require more
attention in future research
- …
