14,555 research outputs found
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
Rapidash: Efficient Constraint Discovery via Rapid Verification
Denial Constraint (DC) is a well-established formalism that captures a wide
range of integrity constraints commonly encountered, including candidate keys,
functional dependencies, and ordering constraints, among others. Given their
significance, there has been considerable research interest in achieving fast
verification and discovery of exact DCs within the database community. Despite
the significant advancements in the field, prior work exhibits notable
limitations when confronted with large-scale datasets. The current
state-of-the-art exact DC verification algorithm demonstrates a quadratic
(worst-case) time complexity relative to the dataset's number of rows. In the
context of DC discovery, existing methodologies rely on a two-step algorithm
that commences with an expensive data structure-building phase, often requiring
hours to complete even for datasets containing only a few million rows.
Consequently, users are left without any insights into the DCs that hold on
their dataset until this lengthy building phase concludes. In this paper, we
introduce Rapidash, a comprehensive framework for DC verification and
discovery. Our work makes a dual contribution. First, we establish a connection
between orthogonal range search and DC verification. We introduce a novel exact
DC verification algorithm that demonstrates near-linear time complexity,
representing a theoretical improvement over prior work. Second, we propose an
anytime DC discovery algorithm that leverages our novel verification algorithm
to gradually provide DCs to users, eliminating the need for the time-intensive
building phase observed in prior work. To validate the effectiveness of our
algorithms, we conduct extensive evaluations on four large-scale production
datasets. Our results reveal that our DC verification algorithm achieves up to
40 times faster performance compared to state-of-the-art approaches.Comment: comments and suggestions are welcome
Against _Against Intellectual Property_: a Short Refutation of Meme Communism
This essay is intended to be a refutation of the main thesis in Against Intellectual Property, Kinsella 2008 (hereafter, K8). Points of agreement, relatively trivial disagreement, and irrelevant issues will largely be ignored, as will much repetition of errors in K8. Otherwise, the procedure is to go through K8 quoting various significantly erroneous parts as they arise and explaining the errors involved. It will not be necessary to respond at the same length as K8 itself
PatchIndex: exploiting approximate constraints in distributed databases
Cloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation
Profiling relational data: a survey
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases
Stuck in Traffic (SiT) Attacks: A Framework for Identifying Stealthy Attacks that Cause Traffic Congestion
Recent advances in wireless technologies have enabled many new applications
in Intelligent Transportation Systems (ITS) such as collision avoidance,
cooperative driving, congestion avoidance, and traffic optimization. Due to the
vulnerable nature of wireless communication against interference and
intentional jamming, ITS face new challenges to ensure the reliability and the
safety of the overall system. In this paper, we expose a class of stealthy
attacks -- Stuck in Traffic (SiT) attacks -- that aim to cause congestion by
exploiting how drivers make decisions based on smart traffic signs. An attacker
mounting a SiT attack solves a Markov Decision Process problem to find
optimal/suboptimal attack policies in which he/she interferes with a
well-chosen subset of signals that are based on the state of the system. We
apply Approximate Policy Iteration (API) algorithms to derive potent attack
policies. We evaluate their performance on a number of systems and compare them
to other attack policies including random, myopic and DoS attack policies. The
generated policies, albeit suboptimal, are shown to significantly outperform
other attack policies as they maximize the expected cumulative reward from the
standpoint of the attacker
- …