727 research outputs found

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    Optimally fast incremental Manhattan plane embedding and planar tight span construction

    Full text link
    We describe a data structure, a rectangular complex, that can be used to represent hyperconvex metric spaces that have the same topology (although not necessarily the same distance function) as subsets of the plane. We show how to use this data structure to construct the tight span of a metric space given as an n x n distance matrix, when the tight span is homeomorphic to a subset of the plane, in time O(n^2), and to add a single point to a planar tight span in time O(n). As an application of this construction, we show how to test whether a given finite metric space embeds isometrically into the Manhattan plane in time O(n^2), and add a single point to the space and re-test whether it has such an embedding in time O(n).Comment: 39 pages, 15 figure

    An O(n^{2.75}) algorithm for online topological ordering

    Full text link
    We present a simple algorithm which maintains the topological order of a directed acyclic graph with n nodes under an online edge insertion sequence in O(n^{2.75}) time, independent of the number of edges m inserted. For dense DAGs, this is an improvement over the previous best result of O(min(m^{3/2} log(n), m^{3/2} + n^2 log(n)) by Katriel and Bodlaender. We also provide an empirical comparison of our algorithm with other algorithms for online topological sorting. Our implementation outperforms them on certain hard instances while it is still competitive on random edge insertion sequences leading to complete DAGs.Comment: 20 pages, long version of SWAT'06 pape

    Search Through Systematic Set Enumeration

    Get PDF
    In many problem domains, solutions take the form of unordered sets. We present the Set-Enumerations (SE)-tree - a vehicle for representing sets and/or enumerating them in a best-first fashion. We demonstrate its usefulness as the basis for a unifying search-based framework for domains where minimal (maximal) elements of a power set are targeted, where minimal (maximal) partial instantiations of a set of variables are sought, or where a composite decision is not dependent on the order in which its primitive component-decisions are taken. Particular instantiations of SE-tree-based algorithms for some AI problem domains are used to demonstrate the general features of the approach. These algorithms are compared theoretically and empirically with current algorithms

    On dualization in products of forests, in

    Get PDF
    Abstract. Let P = P1 ×...×Pn be the product of n partially ordered sets, each with an acyclic precedence graph in which either the in-degree or the out-degree of each element is bounded. Given a subset A⊆P,it is shown that the set of maximal independent elements of A in P can be incrementally generated in quasi-polynomial time. We discuss some applications in data mining related to this dualization problem

    Connectivity in the Presence of an Opponent

    Get PDF

    Efficient Loop Detection in Forwarding Networks and Representing Atoms in a Field of Sets

    Get PDF
    The problem of detecting loops in a forwarding network is known to be NP-complete when general rules such as wildcard expressions are used. Yet, network analyzer tools such as Netplumber (Kazemian et al., NSDI'13) or Veriflow (Khurshid et al., NSDI'13) efficiently solve this problem in networks with thousands of forwarding rules. In this paper, we complement such experimental validation of practical heuristics with the first provably efficient algorithm in the context of general rules. Our main tool is a canonical representation of the atoms (i.e. the minimal non-empty sets) of the field of sets generated by a collection of sets. This tool is particularly suited when the intersection of two sets can be efficiently computed and represented. In the case of forwarding networks, each forwarding rule is associated with the set of packet headers it matches. The atoms then correspond to classes of headers with same behavior in the network. We propose an algorithm for atom computation and provide the first polynomial time algorithm for loop detection in terms of number of classes (which can be exponential in general). This contrasts with previous methods that can be exponential, even in simple cases with linear number of classes. Second, we introduce a notion of network dimension captured by the overlapping degree of forwarding rules. The values of this measure appear to be very low in practice and constant overlapping degree ensures polynomial number of header classes. Forwarding loop detection is thus polynomial in forwarding networks with constant overlapping degree

    A higher-dimensional homologically persistent skeleton

    Get PDF
    Real data is often given as a point cloud, i.e. a finite set of points with pairwise distances between them. An important problem is to detect the topological shape of data — for example, to approximate a point cloud by a low-dimensional non-linear subspace such as an embedded graph or a simplicial complex. Classical clustering methods and principal component analysis work well when data points split into good clusters or lie near linear subspaces of a Euclidean space. Methods from topological data analysis in general metric spaces detect more complicated patterns such as holes and voids that persist for a large interval in a 1-parameter family of shapes associated to a cloud. These features can be visualized in the form of a 1-dimensional homologically persistent skeleton, which optimally extends a minimum spanning tree of a point cloud to a graph with cycles. We generalize this skeleton to higher dimensions and prove its optimality among all complexes that preserve topological features of data at any scale
    • 

    corecore