56 research outputs found

    Should We Use the Sample? Analyzing Datasets Sampled from Twitter's Stream API

    Get PDF
    National Research Foundation (NRF) Singapore under International Research Centre @ Singapore Funding Initiativ

    Data Preparation for Social Network Mining and Analysis

    Get PDF

    A Path Toward the Use of Trail Users’ Tweets to Assess Effectiveness of the Environmental Stewardship Scheme: An Exploratory Analysis of the Pennine Way National Trail

    Get PDF
    Large and unofficial data sets, for instance those gathered from social media, are increasingly being used in geographical research and explored as decision support tools for policy development. Social media data have the potential to provide new insight into phenomena about which there is little information from conventional sources. Within this context, this paper explores the potential of social media data to evaluate the aesthetic management of landscape. Specifically, this project utilises the perceptions of visitors to the Pennine Way National Trail, which passes through land managed under the Environmental Stewardship Scheme (ESS). The method analyses sentiment in trail users’ public Twitter messages (tweets) with the aim of assessing the extent to which the ESS maintains landscape character within the trail corridor. The method demonstrates the importance of filtering social media data to convert it into useful information. After filtering, the results are based on 161 messages directly related to the trail. Although small, this sample illustrates the potential for social media to be used as a cheap and increasingly abundant source of information. We suggest that social media data in this context should be seen as a resource that can complement, rather than replace, conventional data sources such as questionnaires and interviews. Furthermore, we provide guidance on how social media could be effectively used by conservation bodies, such as Natural England, which are charged with the management of areas of environmental value worldwide

    Examining Canada’s Scientific Literacy Through COVID-19 Tweets

    Get PDF
    Scientific misinformation spread on social media is a concern for science communicators, health communicators, and science educators alike. During the COVID-19 pandemic, the World Health Organization (WHO) released a statement that modern technology has created an infodemic, undermining the COVID-19 response effort. Misinformation spread online threatens public health and can endanger lives. So how do we combat it? The leading solution is education, in particular, equipping individuals with scientific literacy. Scientific literacy, or the ability to critically evaluate, understand, and make decisions regarding scientific information, is the goal of science curriculums globally. There has been much research over the past couple of decades regarding the usage of scientific literacy in formal learning environments. In contrast, the relationship between scientific literacy and online informal learning environments such as social media is not well understood. Our case study sought to help fill this gap in the research by exploring how Canadians employ scientific literacy on Twitter—a popular social media site—when discussing the COVID-19 pandemic. We conducted an exploratory qualitative case study exploring 2 600 tweets originating from accounts with user locations in Canada and shared on Twitter during the first ten months of the pandemic (March 2020 to December 2020) to see whether and how they displayed scientific literacy. In addition,­­ we examined the trends and factors that affect the usage of scientific literacy online. Using qualitative content analysis techniques and supplemental statistical analysis, we found that 10% of tweets sampled displayed scientific literacy, while 2% did not exhibit scientific literacy. There were no interprovincial differences in how Canadians displayed scientific literacy, with all provinces sampled exhibiting scientific literacy in approximately 10% of tweets. Furthermore, scientific literacy was not affected by how often the user tweeted, how many followers they had, or the month the tweet was shared. We discovered a strong relationship between the tweet\u27s topic and if it displayed scientific literacy or a lack of scientific literacy. Our study provides more insight into how scientific literacy is displayed online. Future researchers can use this as a starting point to conduct studies exploring how scientific literacy is employed in online spaces in different locations and contexts globally

    Detecting Abnormal Behavior in Web Applications

    Get PDF
    The rapid advance of web technologies has made the Web an essential part of our daily lives. However, network attacks have exploited vulnerabilities of web applications, and caused substantial damages to Internet users. Detecting network attacks is the first and important step in network security. A major branch in this area is anomaly detection. This dissertation concentrates on detecting abnormal behaviors in web applications by employing the following methodology. For a web application, we conduct a set of measurements to reveal the existence of abnormal behaviors in it. We observe the differences between normal and abnormal behaviors. By applying a variety of methods in information extraction, such as heuristics algorithms, machine learning, and information theory, we extract features useful for building a classification system to detect abnormal behaviors.;In particular, we have studied four detection problems in web security. The first is detecting unauthorized hotlinking behavior that plagues hosting servers on the Internet. We analyze a group of common hotlinking attacks and web resources targeted by them. Then we present an anti-hotlinking framework for protecting materials on hosting servers. The second problem is detecting aggressive behavior of automation on Twitter. Our work determines whether a Twitter user is human, bot or cyborg based on the degree of automation. We observe the differences among the three categories in terms of tweeting behavior, tweet content, and account properties. We propose a classification system that uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot or cyborg. Furthermore, we shift the detection perspective from automation to spam, and introduce the third problem, namely detecting social spam campaigns on Twitter. Evolved from individual spammers, spam campaigns manipulate and coordinate multiple accounts to spread spam on Twitter, and display some collective characteristics. We design an automatic classification system based on machine learning, and apply multiple features to classifying spam campaigns. Complementary to conventional spam detection methods, our work brings efficiency and robustness. Finally, we extend our detection research into the blogosphere to capture blog bots. In this problem, detecting the human presence is an effective defense against the automatic posting ability of blog bots. We introduce behavioral biometrics, mainly mouse and keyboard dynamics, to distinguish between human and bot. By passively monitoring user browsing activities, this detection method does not require any direct user participation, and improves the user experience

    Tuning in to Terrorist Signals

    Get PDF

    Mining and Managing User-Generated Content and Preferences

    Get PDF
    Ιn this thesis, we present techniques to manage the results of expressive queries, such as skyline, and mine online content that has been generated by users. Given the numerous scenarios and applications where content mining can be applied, we focus, in particular, to two cases: review mining and social media analysis. More specifically, we focus on preference queries, where users can query a set of items, each associated with an attribute set. For each of the attributes, users can specify their preference on whether to minimize or maximize it, e.g., "minimize price", "maximize performance", etc. Such queries are also know as "pareto optimal", or "skyline queries". A drawback of this query type is that the result may become too large for the user to inspect manually. We propose an approach that addresses this issue, by selecting a set of diverse skyline results. We provide a formal definition of skyline diversification and present efficient techniques to return such a set of points. The result can then be ranked according to established quality criteria. We also propose an alternative scheme for ranking skyline results, following an information retrieval approach

    Novel nonparametric method for classifying time series

    Get PDF
    Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (pages 67-68).In supervised classification, one attempts to learn a model of how objects map to labels by selecting the best model from some model space. The choice of model space encodes assumptions about the problem. We propose a setting for model specification and selection in supervised learning based on a latent source model. In this setting, we specify the model by a small collection of unknown latent sources and posit that there is a stochastic model relating latent sources and observations. With this setting in mind, we propose a nonparametric classification method that is entirely unaware of the structure of these latent sources. Instead, our method relies on the data as a proxy for the unknown latent sources. We perform classification by computing the conditional class probabilities for an observation based on our stochastic model. This approach has an appealing and natural interpretation - that an observation belongs to a certain class if it sufficiently resembles other examples of that class. We extend this approach to the problem of online time series classification. In the binary case, we derive an estimator for online signal detection and an associated implementation that is simple, efficient, and scalable. We demonstrate the merit of our approach by applying it to the task of detecting trending topics on Twitter. Using a small sample of Tweets, our method can detect trends before Twitter does 79% of the time, with a mean early advantage of 1.43 hours, while maintaining a 95% true positive rate and a 4% false positive rate. In addition, our method provides the flexibility to perform well under a variety of tradeoffs between types of error and relative detection time.by Stanislav Nikolov.M. Eng
    • …
    corecore