32,031 research outputs found

    HoloDetect: Few-Shot Learning for Error Detection

    Full text link
    We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages

    Efficient Discovery of Ontology Functional Dependencies

    Full text link
    Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors often require user input to resolve due to domain expertise defining specific terminology and relationships. For example, in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be captured in a pharmaceutical ontology. While functional dependencies (FDs) have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., is-a) defined by an ontology. In this paper, we take a first step towards extending the set of data quality constraints used in data cleaning by defining and discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out theoretical and practical foundations for OFDs, including a set of sound and complete axioms, and a linear inference procedure. We then develop effective algorithms for discovering OFDs, and a set of optimizations that efficiently prune the search space. Our experimental evaluation using real data show the scalability and accuracy of our algorithms.Comment: 12 page

    Intrusion Detection Systems Using Adaptive Regression Splines

    Full text link
    Past few years have witnessed a growing recognition of intelligent techniques for the construction of efficient and reliable intrusion detection systems. Due to increasing incidents of cyber attacks, building effective intrusion detection systems (IDS) are essential for protecting information systems security, and yet it remains an elusive goal and a great challenge. In this paper, we report a performance analysis between Multivariate Adaptive Regression Splines (MARS), neural networks and support vector machines. The MARS procedure builds flexible regression models by fitting separate splines to distinct intervals of the predictor variables. A brief comparison of different neural network learning algorithms is also given

    From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back

    Get PDF
    In this work we establish and investigate connections between causes for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new research areas in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes, and the other way around. Causality problems are formulated as diagnosis problems, and the diagnoses provide causes and their responsibilities. The vast body of research on database repairs can be applied to the newer problems of computing actual causes for query answers and their responsibilities. These connections, which are interesting per se, allow us, after a transition -inspired by consistency-based diagnosis- to computational problems on hitting sets and vertex covers in hypergraphs, to obtain several new algorithmic and complexity results for database causality.Comment: To appear in Theory of Computing Systems. By invitation to special issue with extended papers from ICDT 2015 (paper arXiv:1412.4311

    Network Interdiction Using Adversarial Traffic Flows

    Full text link
    Traditional network interdiction refers to the problem of an interdictor trying to reduce the throughput of network users by removing network edges. In this paper, we propose a new paradigm for network interdiction that models scenarios, such as stealth DoS attack, where the interdiction is performed through injecting adversarial traffic flows. Under this paradigm, we first study the deterministic flow interdiction problem, where the interdictor has perfect knowledge of the operation of network users. We show that the problem is highly inapproximable on general networks and is NP-hard even when the network is acyclic. We then propose an algorithm that achieves a logarithmic approximation ratio and quasi-polynomial time complexity for acyclic networks through harnessing the submodularity of the problem. Next, we investigate the robust flow interdiction problem, which adopts the robust optimization framework to capture the case where definitive knowledge of the operation of network users is not available. We design an approximation framework that integrates the aforementioned algorithm, yielding a quasi-polynomial time procedure with poly-logarithmic approximation ratio for the more challenging robust flow interdiction. Finally, we evaluate the performance of the proposed algorithms through simulations, showing that they can be efficiently implemented and yield near-optimal solutions
    • …
    corecore