14,555 research outputs found

    Efficient Discovery of Ontology Functional Dependencies

    Full text link
    Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors often require user input to resolve due to domain expertise defining specific terminology and relationships. For example, in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be captured in a pharmaceutical ontology. While functional dependencies (FDs) have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., is-a) defined by an ontology. In this paper, we take a first step towards extending the set of data quality constraints used in data cleaning by defining and discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out theoretical and practical foundations for OFDs, including a set of sound and complete axioms, and a linear inference procedure. We then develop effective algorithms for discovering OFDs, and a set of optimizations that efficiently prune the search space. Our experimental evaluation using real data show the scalability and accuracy of our algorithms.Comment: 12 page

    Rapidash: Efficient Constraint Discovery via Rapid Verification

    Full text link
    Denial Constraint (DC) is a well-established formalism that captures a wide range of integrity constraints commonly encountered, including candidate keys, functional dependencies, and ordering constraints, among others. Given their significance, there has been considerable research interest in achieving fast verification and discovery of exact DCs within the database community. Despite the significant advancements in the field, prior work exhibits notable limitations when confronted with large-scale datasets. The current state-of-the-art exact DC verification algorithm demonstrates a quadratic (worst-case) time complexity relative to the dataset's number of rows. In the context of DC discovery, existing methodologies rely on a two-step algorithm that commences with an expensive data structure-building phase, often requiring hours to complete even for datasets containing only a few million rows. Consequently, users are left without any insights into the DCs that hold on their dataset until this lengthy building phase concludes. In this paper, we introduce Rapidash, a comprehensive framework for DC verification and discovery. Our work makes a dual contribution. First, we establish a connection between orthogonal range search and DC verification. We introduce a novel exact DC verification algorithm that demonstrates near-linear time complexity, representing a theoretical improvement over prior work. Second, we propose an anytime DC discovery algorithm that leverages our novel verification algorithm to gradually provide DCs to users, eliminating the need for the time-intensive building phase observed in prior work. To validate the effectiveness of our algorithms, we conduct extensive evaluations on four large-scale production datasets. Our results reveal that our DC verification algorithm achieves up to 40 times faster performance compared to state-of-the-art approaches.Comment: comments and suggestions are welcome

    Against _Against Intellectual Property_: a Short Refutation of Meme Communism

    Get PDF
    This essay is intended to be a refutation of the main thesis in Against Intellectual Property, Kinsella 2008 (hereafter, K8). Points of agreement, relatively trivial disagreement, and irrelevant issues will largely be ignored, as will much repetition of errors in K8. Otherwise, the procedure is to go through K8 quoting various significantly erroneous parts as they arise and explaining the errors involved. It will not be necessary to respond at the same length as K8 itself

    PatchIndex: exploiting approximate constraints in distributed databases

    Get PDF
    Cloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Stuck in Traffic (SiT) Attacks: A Framework for Identifying Stealthy Attacks that Cause Traffic Congestion

    Full text link
    Recent advances in wireless technologies have enabled many new applications in Intelligent Transportation Systems (ITS) such as collision avoidance, cooperative driving, congestion avoidance, and traffic optimization. Due to the vulnerable nature of wireless communication against interference and intentional jamming, ITS face new challenges to ensure the reliability and the safety of the overall system. In this paper, we expose a class of stealthy attacks -- Stuck in Traffic (SiT) attacks -- that aim to cause congestion by exploiting how drivers make decisions based on smart traffic signs. An attacker mounting a SiT attack solves a Markov Decision Process problem to find optimal/suboptimal attack policies in which he/she interferes with a well-chosen subset of signals that are based on the state of the system. We apply Approximate Policy Iteration (API) algorithms to derive potent attack policies. We evaluate their performance on a number of systems and compare them to other attack policies including random, myopic and DoS attack policies. The generated policies, albeit suboptimal, are shown to significantly outperform other attack policies as they maximize the expected cumulative reward from the standpoint of the attacker
    • …
    corecore