155 research outputs found

    Combining Supervised and Unsupervised Learning to Detect and Semantically Aggregate Crisis-Related Twitter Content

    Get PDF
    Twitter is an immediate and almost ubiquitous platform and therefore can be a valuable source of information during disasters. Current methods for identifying and classifying crisis-related content are often based on single tweets, i.e., already known information from the past is neglected. In this paper, the combination of tweet-wise pre-trained neural networks and unsupervised semantic clustering is proposed and investigated. The intention is to (1) enhance the generalization capability of pre-trained models, (2) to be able to handle massive amounts of stream data, (3) to reduce information overload by identifying potentially crisis-related content, and (4) to obtain a semantically aggregated data representation that allows for further automated, manual and visual analyses. Latent representations of each tweet based on pre-trained sentence embedding models are used for both, clustering and tweet classification. For a fast, robust and time-continuous processing, subsequent time periods are clustered individually according to a Chinese restaurant process. Clusters without any tweet classified as crisis-related are pruned. Data aggregation over time is ensured by merging semantically similar clusters. A comparison of our hybrid method to a similar clustering approach, as well as first quantitative and qualitative results from experiments with two different labeled data sets demonstrate the great potential for crisis-related Twitter stream analyses

    Modern Survey Estimation with Social Media and Auxiliary Data

    Full text link
    Traditional survey methods have been successful for nearly a century, but recently response rates have been declining and costs have been increasing, making the future of survey science uncertain. At the same time, new media sources are generating new forms of data, population data is increasingly readily available, and sophisticated machine learning algorithms are being created. This dissertation uses modern data sources and tools to improve survey estimates and advance the field of survey science. We begin by exploring the challenges of using data from new media, demonstrating how relationships between social media data and survey responses can appear deceptively strong. We examine a previously observed relationship between sentiment of ``jobs" tweets and consumer confidence, performing a sensitivity analysis on how sentiment of tweets is calculated and sorting ``jobs" tweets into categories based on their content, concluding that the original observed relationship was merely a chance occurrence. Next we track the relationship between sentiment of ``Trump" tweets and presidential approval. We develop a framework to interpret the strength of this observed relationship by implementing placebo analyses, in which we perform the same analysis but with tweets assumed to be unrelated to presidential approval, concluding that our observed relationship is not strong. Failing to find a meaningful signal, we next propose following a set of users over time. For a set of politically active users, we are able to find evidence of a political signal in terms of frequency and sentiment of their tweets around the 2016 presidential election. In a given corpus of tweets, there are likely to be several topics present, which has the potential to introduce bias when using the corpus to track survey responses. To help discover and sort tweets into these topics, we create a clustering-based topic modeling algorithm. Using the entire corpus, we create distances between words based on how often they appear together in the same tweet, create distances between tweets based on the distance between words in the tweets, and perform clustering on the resulting distances. We show that this method is effective using a validation set of tweets and apply it to the corpus of tweets from politically active users and ``jobs" tweets. Finally, we use population auxiliary data and machine learning algorithms to improve survey estimates. We develop an imputation-based estimation method that produces an unbiased estimate of the mean response of a finite population from a simple random sample when population auxiliary data are available. Our method allows for any prediction function or machine learning algorithm to be used to predict the response for out-of-sample observations, and is therefore able to accommodate a high dimensional setting and all covariate types. Exact unbiasedness is guaranteed by estimating the bias of the prediction function using subsamples of the original simple random sample. Importantly, the unbiasedness property does not depend on the accuracy of the imputation method. We apply this estimation method to simulated data, college tuition data, and the American Community Survey.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163193/1/fergr_1.pd

    Enhanced web-based summary generation for search.

    Get PDF
    After a user types in a search query on a major search engine, they are presented with a number of search results. Each search result is made up of a title, brief text summary and a URL. It is then the user\u27s job to select documents for further review. Our research aims to improve the accuracy of users selecting relevant documents by improving the way these web pages are summarized. Improvements in accuracy will lead to time improvements and user experience improvements. We propose ReClose, a system for generating web document summaries. ReClose generates summary content through combining summarization techniques from query-biased and query-independent summary generation. Query-biased summaries generally provide query terms in context. Query-independent summaries focus on summarizing documents as a whole. Combining these summary techniques led to a 10% improvement in user decision making over Google generated summaries. Color-coded ReClose summaries provide keyword usage depth at a glance and also alert users to topic departures. Color-coding further enhanced ReClose results and led to a 20% improvement in user decision making over Google generated summaries. Many online documents include structure and multimedia of various forms such as tables, lists, forms and images. We propose to include this structure in web page summaries. We found that the expert user was insignificantly slowed in decision making while the majority of average users made decisions more quickly using summaries including structure without any decrease in decision accuracy. We additionally extended ReClose for use in summarizing large numbers of tweets in tracking flu outbreaks in social media. The resulting summaries have variable length and are effective at summarizing flu related trends. Users of the system obtained an accuracy of 0.86 labeling multi-tweet summaries. This showed that the basis of ReClose is effective outside of web documents and that variable length summaries can be more effective than fixed length. Overall the ReClose system provides unique summaries that contain more informative content than current search engines produce, highlight the results in a more meaningful way, and add structure when meaningful. The applications of ReClose extend far beyond search and have been demonstrated in summarizing pools of tweets

    DESCRIBING URGENT EVENT DIFFUSION ON TWITTER USING NETWORK STATISTICS

    Get PDF
    In this dissertation, I develop a novel framework to study the diffusion of urgent events through the popular social media platform—Twitter. Based on my literature review, this is the first comprehensive study on urgent event diffusion through Twitter. I observe similar diffusion patterns among different data sets and adopt the "cross prediction" mode to handle the early time prediction problem. I show that the statistics from the network of Twitter retweets can not only provide profound insights about event diffusion, but also can be used to effectively predict user influence and topic popularity. The above findings are consistent across various experiment settings. I also demonstrate that linear models consistently outperform state-of-art nonlinear ones in both user and hashtag prediction tasks, possibly implying the strong log-linear relationship between selected prediction features and the responses, which potentially could be a general phenomenon in the case of urgent event diffusion

    Review article: Detection of actionable tweets in crisis events

    Get PDF
    Messages on social media can be an important source of information during crisis situations. They can frequently provide details about developments much faster than traditional sources (e.g., official news) and can offer personal perspectives on events, such as opinions or specific needs. In the future, these messages can also serve to assess disaster risks. One challenge for utilizing social media in crisis situations is the reliable detection of relevant messages in a flood of data. Researchers have started to look into this problem in recent years, beginning with crowdsourced methods. Lately, approaches have shifted towards an automatic analysis of messages. A major stumbling block here is the question of exactly what messages are considered relevant or informative, as this is dependent on the specific usage scenario and the role of the user in this scenario. In this review article, we present methods for the automatic detection of crisis-related messages (tweets) on Twitter. We start by showing the varying definitions of importance and relevance relating to disasters, leading into the concept of use case-dependent actionability that has recently become more popular and is the focal point of the review paper. This is followed by an overview of existing crisis-related social media data sets for evaluation and training purposes. We then compare approaches for solving the detection problem based (1) on filtering by characteristics like keywords and location, (2) on crowdsourcing, and (3) on machine learning technique. We analyze their suitability and limitations of the approaches with regards to actionability. We then point out particular challenges, such as the linguistic issues concerning social media data. Finally, we suggest future avenues of research and show connections to related tasks, such as the subsequent semantic classification of tweets

    A novel data analytic model for mining user insurance demands from microblogs

    Get PDF
    This paper proposes a method based on LDA model and Word2Vec for analyzing Microblog users' insurance demands. First of all, we use LDA model to analyze the text data of Microblog user to get their candidate topic. Secondly, we use CBOW model to implement topic word vectorization and use word similarity calculation to expand it. Then we use K-means model to cluster the expanded words and redefine the topic category. Then we use the LDA model to extract the keywords of various insurance information on the “Pingan Insurance” website and analyze the possibility of users with different demands to purchase various types of insurance with the help of word vector similarity. Finally, the validity of the method in this paper is verified against Microblog user information. The experimental results show that the accuracy, recall rate and F1 value of the LDA-CBOW extending method have been proposed compared with that of the traditional LDA model, respectively, which proves the feasibility of this method. The results of this paper will help insurance companies to accurately grasp the preferences of Microblog users, understand the potential insurance needs of users timely, and lay a foundation for personalized recommendation of insurance products

    FUSING PHYSICAL AND SOCIAL SENSORS FOR SITUATION AWARENESS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Neighborhood preserving discrimination for rotor fault feature data set dimensionally reduction

    Get PDF
    NPP (Neighborhood Preserving Projections) is an incremental subspace learning methods which has a nature of maintaining the data local neighborhood geometry constant. To improve the discriminatory power of NPP, NPD (Neighborhood Preserving Discrimination) algorithm was proposed to be used for the rotor system fault data set feature dimensionality reduction. Floyd algorithm based on graph theory and MMC (Maximum Margin Criterion) were introduced in the NPP which makes NPD avoid the short-circuit problem that occurs in the high curvature high dimensional space data sets, while enhancing data discrimination information during the dimensionality reduction. In addition, NPD can maintain the manifold of data set unchanged. At last, the rotor-bearing experiment has been made to verify the effectiveness of the NPD method

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Visual Analytics Methods for Exploring Geographically Networked Phenomena

    Get PDF
    abstract: The connections between different entities define different kinds of networks, and many such networked phenomena are influenced by their underlying geographical relationships. By integrating network and geospatial analysis, the goal is to extract information about interaction topologies and the relationships to related geographical constructs. In the recent decades, much work has been done analyzing the dynamics of spatial networks; however, many challenges still remain in this field. First, the development of social media and transportation technologies has greatly reshaped the typologies of communications between different geographical regions. Second, the distance metrics used in spatial analysis should also be enriched with the underlying network information to develop accurate models. Visual analytics provides methods for data exploration, pattern recognition, and knowledge discovery. However, despite the long history of geovisualizations and network visual analytics, little work has been done to develop visual analytics tools that focus specifically on geographically networked phenomena. This thesis develops a variety of visualization methods to present data values and geospatial network relationships, which enables users to interactively explore the data. Users can investigate the connections in both virtual networks and geospatial networks and the underlying geographical context can be used to improve knowledge discovery. The focus of this thesis is on social media analysis and geographical hotspots optimization. A framework is proposed for social network analysis to unveil the links between social media interactions and their underlying networked geospatial phenomena. This will be combined with a novel hotspot approach to improve hotspot identification and boundary detection with the networks extracted from urban infrastructure. Several real world problems have been analyzed using the proposed visual analytics frameworks. The primary studies and experiments show that visual analytics methods can help analysts explore such data from multiple perspectives and help the knowledge discovery process.Dissertation/ThesisDoctoral Dissertation Computer Science 201
    corecore