32,031 research outputs found
HoloDetect: Few-Shot Learning for Error Detection
We introduce a few-shot learning framework for error detection. We show that
data augmentation (a form of weak supervision) is key to training high-quality,
ML-based error detection models that require minimal human involvement. Our
framework consists of two parts: (1) an expressive model to learn rich
representations that capture the inherent syntactic and semantic heterogeneity
of errors; and (2) a data augmentation model that, given a small seed of clean
records, uses dataset-specific transformations to automatically generate
additional training data. Our key insight is to learn data augmentation
policies from the noisy input dataset in a weakly supervised manner. We show
that our framework detects errors with an average precision of ~94% and an
average recall of ~93% across a diverse array of datasets that exhibit
different types and amounts of errors. We compare our approach to a
comprehensive collection of error detection methods, ranging from traditional
rule-based methods to ensemble-based and active learning approaches. We show
that data augmentation yields an average improvement of 20 F1 points while it
requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
Intrusion Detection Systems Using Adaptive Regression Splines
Past few years have witnessed a growing recognition of intelligent techniques
for the construction of efficient and reliable intrusion detection systems. Due
to increasing incidents of cyber attacks, building effective intrusion
detection systems (IDS) are essential for protecting information systems
security, and yet it remains an elusive goal and a great challenge. In this
paper, we report a performance analysis between Multivariate Adaptive
Regression Splines (MARS), neural networks and support vector machines. The
MARS procedure builds flexible regression models by fitting separate splines to
distinct intervals of the predictor variables. A brief comparison of different
neural network learning algorithms is also given
From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back
In this work we establish and investigate connections between causes for
query answers in databases, database repairs wrt. denial constraints, and
consistency-based diagnosis. The first two are relatively new research areas in
databases, and the third one is an established subject in knowledge
representation. We show how to obtain database repairs from causes, and the
other way around. Causality problems are formulated as diagnosis problems, and
the diagnoses provide causes and their responsibilities. The vast body of
research on database repairs can be applied to the newer problems of computing
actual causes for query answers and their responsibilities. These connections,
which are interesting per se, allow us, after a transition -inspired by
consistency-based diagnosis- to computational problems on hitting sets and
vertex covers in hypergraphs, to obtain several new algorithmic and complexity
results for database causality.Comment: To appear in Theory of Computing Systems. By invitation to special
issue with extended papers from ICDT 2015 (paper arXiv:1412.4311
Network Interdiction Using Adversarial Traffic Flows
Traditional network interdiction refers to the problem of an interdictor
trying to reduce the throughput of network users by removing network edges. In
this paper, we propose a new paradigm for network interdiction that models
scenarios, such as stealth DoS attack, where the interdiction is performed
through injecting adversarial traffic flows. Under this paradigm, we first
study the deterministic flow interdiction problem, where the interdictor has
perfect knowledge of the operation of network users. We show that the problem
is highly inapproximable on general networks and is NP-hard even when the
network is acyclic. We then propose an algorithm that achieves a logarithmic
approximation ratio and quasi-polynomial time complexity for acyclic networks
through harnessing the submodularity of the problem. Next, we investigate the
robust flow interdiction problem, which adopts the robust optimization
framework to capture the case where definitive knowledge of the operation of
network users is not available. We design an approximation framework that
integrates the aforementioned algorithm, yielding a quasi-polynomial time
procedure with poly-logarithmic approximation ratio for the more challenging
robust flow interdiction. Finally, we evaluate the performance of the proposed
algorithms through simulations, showing that they can be efficiently
implemented and yield near-optimal solutions
- …