119 research outputs found

    On Integrity Constraints for a Waste Management Information System

    Get PDF
    There is a waste problem in nearly every country. A model of a waste generating system and an efficient waste management information system are the first steps to control this problem. Some countries have already enacted laws which force communities and enterprises to report annually the amounts of wastes produced. For example, the German federal state, Lower Saxony, enacted such a law in 1992. This YSSP-Project deals with a case study on the development of a waste management information system for this state. The quality of the system essentially depends on the consistency of the underlying database system. Therefore, the point of view of a database designer is given. The design of the data structures and the support of integrity constraints in the underlying database system is thereby especially emphasized. The data structures are modelled using an extended entity relationship model. They are implemented with a relational database management system. In contrast to the traditional way of supporting integrity constraints in the application program, we define triggers which are implemented in the database system itself to enforce main consistency rules. In the final chapter some conclusions about further steps are given

    Efficient Discovery of Ontology Functional Dependencies

    Full text link
    Poor data quality has become a pervasive issue due to the increasing complexity and size of modern datasets. Constraint based data cleaning techniques rely on integrity constraints as a benchmark to identify and correct errors. Data values that do not satisfy the given set of constraints are flagged as dirty, and data updates are made to re-align the data and the constraints. However, many errors often require user input to resolve due to domain expertise defining specific terminology and relationships. For example, in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be captured in a pharmaceutical ontology. While functional dependencies (FDs) have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., is-a) defined by an ontology. In this paper, we take a first step towards extending the set of data quality constraints used in data cleaning by defining and discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out theoretical and practical foundations for OFDs, including a set of sound and complete axioms, and a linear inference procedure. We then develop effective algorithms for discovering OFDs, and a set of optimizations that efficiently prune the search space. Our experimental evaluation using real data show the scalability and accuracy of our algorithms.Comment: 12 page

    Data quality: Some comments on the NASA software defect datasets

    Get PDF
    Background-Self-evidently empirical analyses rely upon the quality of their data. Likewise, replications rely upon accurate reporting and using the same rather than similar versions of datasets. In recent years, there has been much interest in using machine learners to classify software modules into defect-prone and not defect-prone categories. The publicly available NASA datasets have been extensively used as part of this research. Objective-This short note investigates the extent to which published analyses based on the NASA defect datasets are meaningful and comparable. Method-We analyze the five studies published in the IEEE Transactions on Software Engineering since 2007 that have utilized these datasets and compare the two versions of the datasets currently in use. Results-We find important differences between the two versions of the datasets, implausible values in one dataset and generally insufficient detail documented on dataset preprocessing. Conclusions-It is recommended that researchers 1) indicate the provenance of the datasets they use, 2) report any preprocessing in sufficient detail to enable meaningful replication, and 3) invest effort in understanding the data prior to applying machine learners

    Constraints for Semistructured Data and XML

    Get PDF
    Integrity constraints play a fundamental role in database design. We review initial work on the expression of integrity constraints for semistructured data and XML

    Verifying UML/OCL operation contracts

    Get PDF
    In current model-driven development approaches, software models are the primary artifacts of the development process. Therefore, assessment of their correctness is a key issue to ensure the quality of the final application. Research on model consistency has focused mostly on the models' static aspects. Instead, this paper addresses the verification of their dynamic aspects, expressed as a set of operations defined by means of pre/postcondition contracts. This paper presents an automatic method based on Constraint Programming to verify UML models extended with OCL constraints and operation contracts. In our approach, both static and dynamic aspects are translated into a Constraint Satisfaction Problem. Then, compliance of the operations with respect to several correctness properties such as operation executability or determinism are formally verified

    Transformation Techniques for OCL Constraints

    Get PDF
    Constraints play a key role in the definition of conceptual schemas. In the UML, constraints are usually specified by means of invariants written in the OCL. However, due to the high expressiveness of the OCL, the designer has different syntactic alternatives to express each constraint. The techniques presented in this paper assist the designer during the definition of the constraints by means of generating equivalent alternatives for the initially defined ones. Moreover, in the context of the MDA, transformations between these different alternatives are required as part of the PIM-to-PIM, PIM-to-PSM or PIM-to-code transformations of the original conceptual schema

    Completeness and Consistency Analysis for Evolving Knowledge Bases

    Full text link
    Assessing the quality of an evolving knowledge base is a challenging task as it often requires to identify correct quality assessment procedures. Since data is often derived from autonomous, and increasingly large data sources, it is impractical to manually curate the data, and challenging to continuously and automatically assess their quality. In this paper, we explore two main areas of quality assessment related to evolving knowledge bases: (i) identification of completeness issues using knowledge base evolution analysis, and (ii) identification of consistency issues based on integrity constraints, such as minimum and maximum cardinality, and range constraints. For completeness analysis, we use data profiling information from consecutive knowledge base releases to estimate completeness measures that allow predicting quality issues. Then, we perform consistency checks to validate the results of the completeness analysis using integrity constraints and learning models. The approach has been tested both quantitatively and qualitatively by using a subset of datasets from both DBpedia and 3cixty knowledge bases. The performance of the approach is evaluated using precision, recall, and F1 score. From completeness analysis, we observe a 94% precision for the English DBpedia KB and 95% precision for the 3cixty Nice KB. We also assessed the performance of our consistency analysis by using five learning models over three sub-tasks, namely minimum cardinality, maximum cardinality, and range constraint. We observed that the best performing model in our experimental setup is the Random Forest, reaching an F1 score greater than 90% for minimum and maximum cardinality and 84% for range constraints.Comment: Accepted for Journal of Web Semantic

    A Simple Proportional Conflict Redistribution Rule

    Full text link
    One proposes a first alternative rule of combination to WAO (Weighted Average Operator) proposed recently by Josang, Daniel and Vannoorenberghe, called Proportional Conflict Redistribution rule (denoted PCR1). PCR1 and WAO are particular cases of WO (the Weighted Operator) because the conflicting mass is redistributed with respect to some weighting factors. In this first PCR rule, the proportionalization is done for each non-empty set with respect to the non-zero sum of its corresponding mass matrix - instead of its mass column average as in WAO, but the results are the same as Ph. Smets has pointed out. Also, we extend WAO (which herein gives no solution) for the degenerate case when all column sums of all non-empty sets are zero, and then the conflicting mass is transferred to the non-empty disjunctive form of all non-empty sets together; but if this disjunctive form happens to be empty, then one considers an open world (i.e. the frame of discernment might contain new hypotheses) and thus all conflicting mass is transferred to the empty set. In addition to WAO, we propose a general formula for PCR1 (WAO for non-degenerate cases).Comment: 21 page

    HoloDetect: Few-Shot Learning for Error Detection

    Full text link
    We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages
    corecore