218 research outputs found

    When is it Biased? Assessing the Representativeness of Twitter's Streaming API

    Full text link
    Twitter has captured the interest of the scientific community not only for its massive user base and content, but also for its openness in sharing its data. Twitter shares a free 1% sample of its tweets through the "Streaming API", a service that returns a sample of tweets according to a set of parameters set by the researcher. Recently, research has pointed to evidence of bias in the data returned through the Streaming API, raising concern in the integrity of this data service for use in research scenarios. While these results are important, the methodologies proposed in previous work rely on the restrictive and expensive Firehose to find the bias in the Streaming API data. In this work we tackle the problem of finding sample bias without the need for "gold standard" Firehose data. Namely, we focus on finding time periods in the Streaming API data where the trend of a hashtag is significantly different from its trend in the true activity on Twitter. We propose a solution that focuses on using an open data source to find bias in the Streaming API. Finally, we assess the utility of the data source in sparse data situations and for users issuing the same query from different regions

    Debiasing Community Detection: The Importance of Lowly-Connected Nodes

    Get PDF
    Community detection is an important task in social network analysis, allowing us to identify and understand the communities within the social structures. However, many community detection approaches either fail to assign low degree (or lowly-connected) users to communities, or assign them to trivially small communities that prevent them from being included in analysis. In this work, we investigate how excluding these users can bias analysis results. We then introduce an approach that is more inclusive for lowly-connected users by incorporating them into larger groups. Experiments show that our approach outperforms the existing state-of-the-art in terms of F1 and Jaccard similarity scores while reducing the bias towards low-degree users

    Finding Eyewitness Tweets During Crises

    Full text link
    Disaster response agencies have started to incorporate social media as a source of fast-breaking information to understand the needs of people affected by the many crises that occur around the world. These agencies look for tweets from within the region affected by the crisis to get the latest updates of the status of the affected region. However only 1% of all tweets are geotagged with explicit location information. First responders lose valuable information because they cannot assess the origin of many of the tweets they collect. In this work we seek to identify non-geotagged tweets that originate from within the crisis region. Towards this, we address three questions: (1) is there a difference between the language of tweets originating within a crisis region and tweets originating outside the region, (2) what are the linguistic patterns that can be used to differentiate within-region and outside-region tweets, and (3) for non-geotagged tweets, can we automatically identify those originating within the crisis region in real-time

    Identifying and Analyzing Cryptocurrency Manipulations in Social Media

    Full text link
    Interest surrounding cryptocurrencies, digital or virtual currencies that are used as a medium for financial transactions, has grown tremendously in recent years. The anonymity surrounding these currencies makes investors particularly susceptible to fraud---such as "pump and dump" scams---where the goal is to artificially inflate the perceived worth of a currency, luring victims into investing before the fraudsters can sell their holdings. Because of the speed and relative anonymity offered by social platforms such as Twitter and Telegram, social media has become a preferred platform for scammers who wish to spread false hype about the cryptocurrency they are trying to pump. In this work we propose and evaluate a computational approach that can automatically identify pump and dump scams as they unfold by combining information across social media platforms. We also develop a multi-modal approach for predicting whether a particular pump attempt will succeed or not. Finally, we analyze the prevalence of bots in cryptocurrency related tweets, and observe a significant increase in bot activity during the pump attempts.Comment: Section 4. Prediction tasks: The training setup and algorithm revised. The details of the training algorithm added. More features added to the feature set. Section 5. Botometer score added as the likelihood of a user being bot. More analysis added on bot activity in cluster

    Feature Selection: A Data Perspective

    Full text link
    Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity based, information theoretical based, sparse learning based and statistical based methods. To facilitate and promote the research in this community, we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms (\url{http://featureselection.asu.edu/}). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research
    corecore