727 research outputs found
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Optimally fast incremental Manhattan plane embedding and planar tight span construction
We describe a data structure, a rectangular complex, that can be used to
represent hyperconvex metric spaces that have the same topology (although not
necessarily the same distance function) as subsets of the plane. We show how to
use this data structure to construct the tight span of a metric space given as
an n x n distance matrix, when the tight span is homeomorphic to a subset of
the plane, in time O(n^2), and to add a single point to a planar tight span in
time O(n). As an application of this construction, we show how to test whether
a given finite metric space embeds isometrically into the Manhattan plane in
time O(n^2), and add a single point to the space and re-test whether it has
such an embedding in time O(n).Comment: 39 pages, 15 figure
An O(n^{2.75}) algorithm for online topological ordering
We present a simple algorithm which maintains the topological order of a
directed acyclic graph with n nodes under an online edge insertion sequence in
O(n^{2.75}) time, independent of the number of edges m inserted. For dense
DAGs, this is an improvement over the previous best result of O(min(m^{3/2}
log(n), m^{3/2} + n^2 log(n)) by Katriel and Bodlaender. We also provide an
empirical comparison of our algorithm with other algorithms for online
topological sorting. Our implementation outperforms them on certain hard
instances while it is still competitive on random edge insertion sequences
leading to complete DAGs.Comment: 20 pages, long version of SWAT'06 pape
Search Through Systematic Set Enumeration
In many problem domains, solutions take the form of unordered sets. We present the Set-Enumerations (SE)-tree - a vehicle for representing sets and/or enumerating them in a best-first fashion. We demonstrate its usefulness as the basis for a unifying search-based framework for domains where minimal (maximal) elements of a power set are targeted, where minimal (maximal) partial instantiations of a set of variables are sought, or where a composite decision is not dependent on the order in which its primitive component-decisions are taken. Particular instantiations of SE-tree-based algorithms for some AI problem domains are used to demonstrate the general features of the approach. These algorithms are compared theoretically and empirically with current algorithms
On dualization in products of forests, in
Abstract. Let P = P1 Ă...ĂPn be the product of n partially ordered sets, each with an acyclic precedence graph in which either the in-degree or the out-degree of each element is bounded. Given a subset AâP,it is shown that the set of maximal independent elements of A in P can be incrementally generated in quasi-polynomial time. We discuss some applications in data mining related to this dualization problem
Efficient Loop Detection in Forwarding Networks and Representing Atoms in a Field of Sets
The problem of detecting loops in a forwarding network is known to be
NP-complete when general rules such as wildcard expressions are used. Yet,
network analyzer tools such as Netplumber (Kazemian et al., NSDI'13) or
Veriflow (Khurshid et al., NSDI'13) efficiently solve this problem in networks
with thousands of forwarding rules. In this paper, we complement such
experimental validation of practical heuristics with the first provably
efficient algorithm in the context of general rules. Our main tool is a
canonical representation of the atoms (i.e. the minimal non-empty sets) of the
field of sets generated by a collection of sets. This tool is particularly
suited when the intersection of two sets can be efficiently computed and
represented. In the case of forwarding networks, each forwarding rule is
associated with the set of packet headers it matches. The atoms then correspond
to classes of headers with same behavior in the network. We propose an
algorithm for atom computation and provide the first polynomial time algorithm
for loop detection in terms of number of classes (which can be exponential in
general). This contrasts with previous methods that can be exponential, even in
simple cases with linear number of classes. Second, we introduce a notion of
network dimension captured by the overlapping degree of forwarding rules. The
values of this measure appear to be very low in practice and constant
overlapping degree ensures polynomial number of header classes. Forwarding loop
detection is thus polynomial in forwarding networks with constant overlapping
degree
A higher-dimensional homologically persistent skeleton
Real data is often given as a point cloud, i.e. a finite set of points with pairwise distances between them. An important problem is to detect the topological shape of data â for example, to approximate a point cloud by a low-dimensional non-linear subspace such as an embedded graph or a simplicial complex. Classical clustering methods and principal component analysis work well when data points split into good clusters or lie near linear subspaces of a Euclidean space. Methods from topological data analysis in general metric spaces detect more complicated patterns such as holes and voids that persist for a large interval in a 1-parameter family of shapes associated to a cloud. These features can be visualized in the form of a 1-dimensional homologically persistent skeleton, which optimally extends a minimum spanning tree of a point cloud to a graph with cycles. We generalize this skeleton to higher dimensions and prove its optimality among all complexes that preserve topological features of data at any scale
- âŠ