2,155 research outputs found
A Survey on Social Media Anomaly Detection
Social media anomaly detection is of critical importance to prevent malicious
activities such as bullying, terrorist attack planning, and fraud information
dissemination. With the recent popularity of social media, new types of
anomalous behaviors arise, causing concerns from various parties. While a large
amount of work have been dedicated to traditional anomaly detection problems,
we observe a surge of research interests in the new realm of social media
anomaly detection. In this paper, we present a survey on existing approaches to
address this problem. We focus on the new type of anomalous phenomena in the
social media and review the recent developed techniques to detect those special
types of anomalies. We provide a general overview of the problem domain, common
formulations, existing methodologies and potential directions. With this work,
we hope to call out the attention from the research community on this
challenging problem and open up new directions that we can contribute in the
future.Comment: 23 page
Spatio-Temporal Data Mining: A Survey of Problems and Methods
Large volumes of spatio-temporal data are increasingly collected and studied
in diverse domains including, climate science, social sciences, neuroscience,
epidemiology, transportation, mobile health, and Earth sciences.
Spatio-temporal data differs from relational data for which computational
approaches are developed in the data mining community for multiple decades, in
that both spatial and temporal attributes are available in addition to the
actual measurements/attributes. The presence of these attributes introduces
additional challenges that needs to be dealt with. Approaches for mining
spatio-temporal data have been studied for over a decade in the data mining
community. In this article we present a broad survey of this relatively young
field of spatio-temporal data mining. We discuss different types of
spatio-temporal data and the relevant data mining questions that arise in the
context of analyzing each of these datasets. Based on the nature of the data
mining problem studied, we classify literature on spatio-temporal data mining
into six major categories: clustering, predictive learning, change detection,
frequent pattern mining, anomaly detection, and relationship mining. We discuss
the various forms of spatio-temporal data mining problems in each of these
categories.Comment: Accepted for publication at ACM Computing Survey
Inferring Multilateral Relations from Dynamic Pairwise Interactions
Correlations between anomalous activity patterns can yield pertinent
information about complex social processes: a significant deviation from normal
behavior, exhibited simultaneously by multiple pairs of actors, provides
evidence for some underlying relationship involving those pairs---i.e., a
multilateral relation. We introduce a new nonparametric Bayesian latent
variable model that explicitly captures correlations between anomalous
interaction counts and uses these shared deviations from normal activity
patterns to identify and characterize multilateral relations. We showcase our
model's capabilities using the newly curated Global Database of Events,
Location, and Tone, a dataset that has seen considerable interest in the social
sciences and the popular press, but which has is largely unexplored by the
machine learning community. We provide a detailed analysis of the latent
structure inferred by our model and show that the multilateral relations
correspond to major international events and long-term international
relationships. These findings lead us to recommend our model for any
data-driven analysis of interaction networks where dynamic interactions over
the edges provide evidence for latent social structure.Comment: NIPS 2013 Workshop on Frontiers of Network Analysi
Finding Likely Errors with Bayesian Specifications
We present a Bayesian framework for learning probabilistic specifications
from large, unstructured code corpora, and a method to use this framework to
statically detect anomalous, hence likely buggy, program behavior. The
distinctive insight here is to build a statistical model that correlates all
specifications hidden inside a corpus with the syntax and observed behavior of
programs that implement these specifications. During the analysis of a
particular program, this model is conditioned into a posterior distribution
that prioritizes specifications that are relevant to this program. This allows
accurate program analysis even if the corpus is highly heterogeneous. The
problem of finding anomalies is now framed quantitatively, as a problem of
computing a distance between a "reference distribution" over program behaviors
that our model expects from the program, and the distribution over behaviors
that the program actually produces.
We present a concrete embodiment of our framework that combines a topic model
and a neural network model to learn specifications, and queries the learned
models to compute anomaly scores. We evaluate this implementation on the task
of detecting anomalous usage of Android APIs. Our encouraging experimental
results show that the method can automatically discover subtle errors in
Android applications in the wild, and has high precision and recall compared to
competing probabilistic approaches
dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems
High performance computing (HPC) facilities consist of a large number of
interconnected computing units (or nodes) that execute highly complex
scientific simulations to support scientific research. Monitoring such
facilities, in real-time, is essential to ensure that the system operates at
peak efficiency. Such systems are typically monitored using a variety of
measurement and log data which capture the state of the various components
within the system at regular intervals of time. As modern HPC systems grow in
capacity and complexity, the data produced by current resource monitoring tools
is at a scale that it is no longer feasible to be visually monitored by
analysts. We propose a method that transforms the multi-dimensional output of
resource monitoring tools to a low dimensional representation that facilitates
the understanding of the behavior of a High Performance Computing (HPC) system.
The proposed method automatically extracts the low-dimensional signal in the
data which can be used to track the system efficiency and identify performance
anomalies. The method models the resource usage data as a three dimensional
tensor (capturing resource usage of all compute nodes for difference resources
over time). A dynamic matrix factorization algorithm, called dynamicMF, is
proposed to extract a low-dimensional temporal signal for each node, which is
subsequently fed into an anomaly detector. Results on resource usage data
collected from the Lonestar 4 system at the Texas Advanced Computing Center
show that the identified anomalies are correlated with actual anomalous events
reported in the system log messages.Comment: 11 page
Composite Behavioral Modeling for Identity Theft Detection in Online Social Networks
In this work, we aim at building a bridge from poor behavioral data to an
effective, quick-response, and robust behavior model for online identity theft
detection. We concentrate on this issue in online social networks (OSNs) where
users usually have composite behavioral records, consisting of
multi-dimensional low-quality data, e.g., offline check-ins and online user
generated content (UGC). As an insightful result, we find that there is a
complementary effect among different dimensions of records for modeling users'
behavioral patterns. To deeply exploit such a complementary effect, we propose
a joint model to capture both online and offline features of a user's composite
behavior. We evaluate the proposed joint model by comparing with some typical
models on two real-world datasets: Foursquare and Yelp. In the widely-used
setting of theft simulation (simulating thefts via behavioral replacement), the
experimental results show that our model outperforms the existing ones, with
the AUC values in Foursquare and in Yelp, respectively.
Particularly, the recall (True Positive Rate) can reach up to in
Foursquare and in Yelp with the corresponding disturbance rate (False
Positive Rate) below . It is worth mentioning that these performances can
be achieved by examining only one composite behavior (visiting a place and
posting a tip online simultaneously) per authentication, which guarantees the
low response latency of our method. This study would give the cybersecurity
community new insights into whether and how a real-time online identity
authentication can be improved via modeling users' composite behavioral
patterns
Sleep Analytics and Online Selective Anomaly Detection
We introduce a new problem, the Online Selective Anomaly Detection (OSAD), to
model a specific scenario emerging from research in sleep science. Scientists
have segmented sleep into several stages and stage two is characterized by two
patterns (or anomalies) in the EEG time series recorded on sleep subjects.
These two patterns are sleep spindle (SS) and K-complex. The OSAD problem was
introduced to design a residual system, where all anomalies (known and unknown)
are detected but the system only triggers an alarm when non-SS anomalies
appear. The solution of the OSAD problem required us to combine techniques from
both machine learning and control theory. Experiments on data from real
subjects attest to the effectiveness of our approach.Comment: Submitted to 20th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining 201
Energy-based Models for Video Anomaly Detection
Automated detection of abnormalities in data has been studied in research
area in recent years because of its diverse applications in practice including
video surveillance, industrial damage detection and network intrusion
detection. However, building an effective anomaly detection system is a
non-trivial task since it requires to tackle challenging issues of the shortage
of annotated data, inability of defining anomaly objects explicitly and the
expensive cost of feature engineering procedure. Unlike existing appoaches
which only partially solve these problems, we develop a unique framework to
cope the problems above simultaneously. Instead of hanlding with ambiguous
definition of anomaly objects, we propose to work with regular patterns whose
unlabeled data is abundant and usually easy to collect in practice. This allows
our system to be trained completely in an unsupervised procedure and liberate
us from the need for costly data annotation. By learning generative model that
capture the normality distribution in data, we can isolate abnormal data points
that result in low normality scores (high abnormality scores). Moreover, by
leverage on the power of generative networks, i.e. energy-based models, we are
also able to learn the feature representation automatically rather than
replying on hand-crafted features that have been dominating anomaly detection
research over many decades. We demonstrate our proposal on the specific
application of video anomaly detection and the experimental results indicate
that our method performs better than baselines and are comparable with
state-of-the-art methods in many benchmark video anomaly detection datasets
Detection of Review Abuse via Semi-Supervised Binary Multi-Target Tensor Decomposition
Product reviews and ratings on e-commerce websites provide customers with
detailed insights about various aspects of the product such as quality,
usefulness, etc. Since they influence customers' buying decisions, product
reviews have become a fertile ground for abuse by sellers (colluding with
reviewers) to promote their own products or to tarnish the reputation of
competitor's products. In this paper, our focus is on detecting such abusive
entities (both sellers and reviewers) by applying tensor decomposition on the
product reviews data. While tensor decomposition is mostly unsupervised, we
formulate our problem as a semi-supervised binary multi-target tensor
decomposition, to take advantage of currently known abusive entities. We
empirically show that our multi-target semi-supervised model achieves higher
precision and recall in detecting abusive entities as compared to unsupervised
techniques. Finally, we show that our proposed stochastic partial natural
gradient inference for our model empirically achieves faster convergence than
stochastic gradient and Online-EM with sufficient statistics.Comment: Accepted to the 25th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, 2019. Contains supplementary material. arXiv admin note: text
overlap with arXiv:1804.0383
The New Abnormal: Network Anomalies in the AI Era
Anomaly detection aims at finding unexpected patterns in data. It has been used in several problems in computer networks, from the detection of port scans and DDoS attacks to the monitoring of time-series collected from Internet monitoring systems. Data-driven approaches and machine learning have seen widespread application on anomaly detection too, and this trend has been accelerated by the recent developments on Artificial Intelligence research. This chapter summarizes ongoing recent progresses on anomaly detection research. In particular, we evaluate how developments on AI algorithms bring new possibilities for anomaly detection. We cover new representation learning techniques such as Generative Artificial Networks and Autoencoders, as well as techniques that can be used to improve models learned with machine learning algorithms, such as reinforcement learning. We survey both research works and tools implementing AI algorithms for anomaly detection. We found that the novel algorithms, while successful in other fields, have hardly been applied to networking problems. We conclude the chapter with a case study that illustrates a possible research direction
- …