3,914 research outputs found

    Anomaly Detection on Social Data

    Get PDF
    The advent of online social media including Facebook, Twitter, Flickr and Youtube has drawn massive attention in recent years. These online platforms generate massive data capturing the behavior of multiple types of human actors as they interact with one another and with resources such as pictures, books and videos. Unfortunately, the openness of these platforms often leaves them highly susceptible to abuse by suspicious entities such as spammers. It therefore becomes increasingly important to automatically identify these suspicious entities and eliminate their threats. We call these suspicious entities anomalies in social data, as they often hold different agenda comparing to normal ones and manifest anomalous behaviors. In this dissertation, we are interested in two kinds of anomalous behaviors in social data, namely the unusual coalition among a collection of entities and the unusual conflicting opinions among entities. The two kinds of anomalous behaviors lead us to define two types of anomalies, namely, anomaly collections of the same entity type and anomalous nodes of different entity types in bipartite graphs. This dissertation introduces two anomaly collection definitions, namely, Extreme Rank Anomalous Collection (or ERAC) and Coherent Anomaly Collection (or CAC). An ERAC is a set of entities that cluster toward the top or bottom ranks, when all entities in the population are ranked on certain features. We propose a statistical model to quantify the anomalousness of an ERAC, and present the exact as well as heuristic algorithms for finding top-K ERACs. We then propose the follow-up problem of expanding top-K ERACs to anomalous supersets. We apply the algorithms for ERAC detection and expansion on both synthetic and real-life datasets, including a web spam, an IMDB and a Chinese online forum dataset. Results show that our algorithms achieve higher precisions compared to existing spam and anomaly detection methods. CAC is defined based on ERAC, emphasizing the coherence among members of an ERAC. As top-K ERACs are often overlapping with each other, for applications where disjoint anomaly collections are of interest, we propose to find top-K disjoint CACs with exact and heuristic algorithms. Experiments on both synthetic and real-life datasets, including a Twitter, a web spam, and a Chinese online forum dataset show that our approach discovers not only injected anomaly collections in synthetic datasets but also real-life coherent collections of hashtag spammer, web spammers and opinion spammers which are hard to detect by clustering-based methods. We detect the second type of anomalies in a bipartite graph, where nodes in one partite represent human actors, nodes in the other partite represent resources, and edges carry the agreeing and disagreeing opinions from human actors to resources. The anomalousness of nodes in one partite depends on that of their connected nodes in the other partite. Previous studies have shown that this mutual dependency can be positive or negative. We integrate both mutual dependency principles to model the anomalous behavior of nodes. We formulate our principles and design an iterative algorithm to simultaneously compute the anomaly scores of nodes in both partites. Our method is applied on synthetic graphs and the results show that our algorithm outperforms existing ones with only positive or negative mutual dependency principles. Results on two real-life datasets, namely Goodreads and Buzzcity, show that our method is able to detect suspected spammed books in Goodreads and fraudulent publishers in mobile advertising networks with higher precision than existing approaches

    Long-Range Correlation Underlying Childhood Language and Generative Models

    Full text link
    Long-range correlation, a property of time series exhibiting long-term memory, is mainly studied in the statistical physics domain and has been reported to exist in natural language. Using a state-of-the-art method for such analysis, long-range correlation is first shown to occur in long CHILDES data sets. To understand why, Bayesian generative models of language, originally proposed in the cognitive scientific domain, are investigated. Among representative models, the Simon model was found to exhibit surprisingly good long-range correlation, but not the Pitman-Yor model. Since the Simon model is known not to correctly reflect the vocabulary growth of natural language, a simple new model is devised as a conjunct of the Simon and Pitman-Yor models, such that long-range correlation holds with a correct vocabulary growth rate. The investigation overall suggests that uniform sampling is one cause of long-range correlation and could thus have a relation with actual linguistic processes

    A multi-modal machine learning approach to detect extreme rainfall events in Sicily

    Get PDF
    In 2021 almost 300 mm of rain, nearly half of the average annual rainfall, fell near Catania (Sicily Island, Italy). Such events took place in just a few hours, with dramatic consequences on the environmental, social, economic, and health systems of the region. These phenomena are now very common in various countries all around the world: this is the reason why, detecting local extreme rainfall events is a crucial prerequisite for planning actions, able to reverse possibly intensified dramatic future scenarios. In this paper, the Affinity Propagation algorithm, a clustering algorithm grounded on machine learning, was applied, to the best of our knowledge, for the first time, to detect extreme rainfall areas in Sicily. This was possible by using a high-frequency, large dataset we collected, ranging from 2009 to 2021 which we named RSE (the Rainfall Sicily Extreme dataset). Weather indicators were then been employed to validate the results, thus confirming the presence of recent anomalous rainfall events in eastern Sicily. We believe that easy-to-use and multi-modal data science techniques, such as the one proposed in this study, could give rise to significant improvements in policy-making for successfully contrasting climate change

    Novel semi-metrics for multivariate change point analysis and anomaly detection

    Full text link
    This paper proposes a new method for determining similarity and anomalies between time series, most practically effective in large collections of (likely related) time series, by measuring distances between structural breaks within such a collection. We introduce a class of \emph{semi-metric} distance measures, which we term \emph{MJ distances}. These semi-metrics provide an advantage over existing options such as the Hausdorff and Wasserstein metrics. We prove they have desirable properties, including better sensitivity to outliers, while experiments on simulated data demonstrate that they uncover similarity within collections of time series more effectively. Semi-metrics carry a potential disadvantage: without the triangle inequality, they may not satisfy a "transitivity property of closeness." We analyse this failure with proof and introduce an computational method to investigate, in which we demonstrate that our semi-metrics violate transitivity infrequently and mildly. Finally, we apply our methods to cryptocurrency and measles data, introducing a judicious application of eigenvalue analysis.Comment: Accepted manuscript. Minor edits since v2. Equal contribution from first two author

    Anchorage: Visual Analysis of Satisfaction in Customer Service Videos via Anchor Events

    Full text link
    Delivering customer services through video communications has brought new opportunities to analyze customer satisfaction for quality management. However, due to the lack of reliable self-reported responses, service providers are troubled by the inadequate estimation of customer services and the tedious investigation into multimodal video recordings. We introduce Anchorage, a visual analytics system to evaluate customer satisfaction by summarizing multimodal behavioral features in customer service videos and revealing abnormal operations in the service process. We leverage the semantically meaningful operations to introduce structured event understanding into videos which help service providers quickly navigate to events of their interest. Anchorage supports a comprehensive evaluation of customer satisfaction from the service and operation levels and efficient analysis of customer behavioral dynamics via multifaceted visualization views. We extensively evaluate Anchorage through a case study and a carefully-designed user study. The results demonstrate its effectiveness and usability in assessing customer satisfaction using customer service videos. We found that introducing event contexts in assessing customer satisfaction can enhance its performance without compromising annotation precision. Our approach can be adapted in situations where unlabelled and unstructured videos are collected along with sequential records.Comment: 13 pages. A preprint version of a publication at IEEE Transactions on Visualization and Computer Graphics (TVCG), 202
    • …
    corecore