3,510 research outputs found
Semi-supervised Embedding in Attributed Networks with Outliers
In this paper, we propose a novel framework, called Semi-supervised Embedding
in Attributed Networks with Outliers (SEANO), to learn a low-dimensional vector
representation that systematically captures the topological proximity,
attribute affinity and label similarity of vertices in a partially labeled
attributed network (PLAN). Our method is designed to work in both transductive
and inductive settings while explicitly alleviating noise effects from
outliers. Experimental results on various datasets drawn from the web, text and
image domains demonstrate the advantages of SEANO over state-of-the-art methods
in semi-supervised classification under transductive as well as inductive
settings. We also show that a subset of parameters in SEANO is interpretable as
outlier score and can significantly outperform baseline methods when applied
for detecting network outliers. Finally, we present the use of SEANO in a
challenging real-world setting -- flood mapping of satellite images and show
that it is able to outperform modern remote sensing algorithms for this task.Comment: in Proceedings of SIAM International Conference on Data Mining
(SDM'18
A Survey of Graph-based Deep Learning for Anomaly Detection in Distributed Systems
Anomaly detection is a crucial task in complex distributed systems. A
thorough understanding of the requirements and challenges of anomaly detection
is pivotal to the security of such systems, especially for real-world
deployment. While there are many works and application domains that deal with
this problem, few have attempted to provide an in-depth look at such systems.
In this survey, we explore the potentials of graph-based algorithms to identify
anomalies in distributed systems. These systems can be heterogeneous or
homogeneous, which can result in distinct requirements. One of our objectives
is to provide an in-depth look at graph-based approaches to conceptually
analyze their capability to handle real-world challenges such as heterogeneity
and dynamic structure. This study gives an overview of the State-of-the-Art
(SotA) research articles in the field and compare and contrast their
characteristics. To facilitate a more comprehensive understanding, we present
three systems with varying abstractions as use cases. We examine the specific
challenges involved in anomaly detection within such systems. Subsequently, we
elucidate the efficacy of graphs in such systems and explicate their
advantages. We then delve into the SotA methods and highlight their strength
and weaknesses, pointing out the areas for possible improvements and future
works.Comment: The first two authors (A. Danesh Pazho and G. Alinezhad Noghre) have
equal contribution. The article is accepted by IEEE Transactions on Knowledge
and Data Engineerin
A Survey of Imbalanced Learning on Graphs: Problems, Techniques, and Future Directions
Graphs represent interconnected structures prevalent in a myriad of
real-world scenarios. Effective graph analytics, such as graph learning
methods, enables users to gain profound insights from graph data, underpinning
various tasks including node classification and link prediction. However, these
methods often suffer from data imbalance, a common issue in graph data where
certain segments possess abundant data while others are scarce, thereby leading
to biased learning outcomes. This necessitates the emerging field of imbalanced
learning on graphs, which aims to correct these data distribution skews for
more accurate and representative learning outcomes. In this survey, we embark
on a comprehensive review of the literature on imbalanced learning on graphs.
We begin by providing a definitive understanding of the concept and related
terminologies, establishing a strong foundational understanding for readers.
Following this, we propose two comprehensive taxonomies: (1) the problem
taxonomy, which describes the forms of imbalance we consider, the associated
tasks, and potential solutions; (2) the technique taxonomy, which details key
strategies for addressing these imbalances, and aids readers in their method
selection process. Finally, we suggest prospective future directions for both
problems and techniques within the sphere of imbalanced learning on graphs,
fostering further innovation in this critical area.Comment: The collection of awesome literature on imbalanced learning on
graphs: https://github.com/Xtra-Computing/Awesome-Literature-ILoG
Unsupervised Anomaly Detection of High Dimensional Data with Low Dimensional Embedded Manifold
Anomaly detection techniques are supposed to identify anomalies from loads of seemingly homogeneous data and being able to do so can lead us to timely, pivotal and actionable decisions, saving us from potential human, financial and informational loss. In anomaly detection, an often encountered situation is the absence of prior knowledge about the nature of anomalies. Such circumstances advocate for ‘unsupervised’ learning-based anomaly detection techniques. Compared to its ‘supervised’ counterpart, which possesses the luxury to utilize a labeled training dataset containing both normal and anomalous samples, unsupervised problems are far more difficult. Moreover, high dimensional streaming data from tons of interconnected sensors present in modern day industries makes the task more challenging. To carry out an investigative effort to address these challenges is the overarching theme of this dissertation.
In this dissertation, the fundamental issue of similarity measure among observations, which is a central piece in any anomaly detection techniques, is reassessed. Manifold hypotheses suggests the possibility of low dimensional manifold structure embedded in high dimensional data. In the presence of such structured space, traditional similarity measures fail to measure the true intrinsic similarity. In light of this revelation, reevaluating the notion of similarity measure seems more pressing rather than providing incremental improvements over any of the existing techniques. A graph theoretic similarity measure is proposed to differentiate and thus identify the anomalies from normal observations. Specifically, the minimum spanning tree (MST), a graph-based approach is proposed to approximate the similarities among data points in the presence of high dimensional structured space. It can track the structure of the embedded manifold better than the existing measures and help to distinguish the anomalies from normal observations. This dissertation investigates further three different aspects of the anomaly detection problem and develops three sets of solution approaches with all of them revolving around the newly proposed MST based similarity measure.
In the first part of the dissertation, a local MST (LoMST) based anomaly detection approach is proposed to detect anomalies using the data in the original space. A two-step procedure is developed to detect both cluster and point anomalies. The next two sets of methods are proposed in the subsequent two parts of the dissertation, for anomaly detection in reduced data space. In the second part of the dissertation, a neighborhood structure assisted version of the nonnegative matrix factorization approach (NS-NMF) is proposed. To detect anomalies, it uses the neighborhood information captured by a sparse MST similarity matrix along with the original attribute information. To meet the industry demands, the online version of both LoMST and NS-NMF is also developed for real-time anomaly detection. In the last part of the dissertation, a graph regularized autoencoder is proposed which uses an MST regularizer in addition to the original loss function and is thus capable of maintaining the local invariance property. All of the approaches proposed in the dissertation are tested on 20 benchmark datasets and one real-life hydropower dataset. When compared with the state of art approaches, all three approaches produce statistically significant better outcomes.
“Industry 4.0” is a reality now and it calls for anomaly detection techniques capable of processing a large amount of high dimensional data generated in real-time. The proposed MST based similarity measure followed by the individual techniques developed in this dissertation are equipped to tackle each of these issues and provide an effective and reliable real-time anomaly identification platform
- …