119 research outputs found
On Integrity Constraints for a Waste Management Information System
There is a waste problem in nearly every country. A model of a waste generating system and an efficient waste management information system are the first steps to control this problem. Some countries have already enacted laws which force communities and enterprises to report annually the amounts of wastes produced. For example, the German federal state, Lower Saxony, enacted such a law in 1992. This YSSP-Project deals with a case study on the development of a waste management information system for this state. The quality of the system essentially depends on the consistency of the underlying database system. Therefore, the point of view of a database designer is given. The design of the data structures and the support of integrity constraints in the underlying database system is thereby especially emphasized. The data structures are modelled using an extended entity relationship model. They are implemented with a relational database management system. In contrast to the traditional way of supporting integrity constraints in the application program, we define triggers which are implemented in the database system itself to enforce main consistency rules. In the final chapter some conclusions about further steps are given
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
Data quality: Some comments on the NASA software defect datasets
Background-Self-evidently empirical analyses rely upon the quality of their data. Likewise, replications rely upon accurate reporting and using the same rather than similar versions of datasets. In recent years, there has been much interest in using machine learners to classify software modules into defect-prone and not defect-prone categories. The publicly available NASA datasets have been extensively used as part of this research. Objective-This short note investigates the extent to which published analyses based on the NASA defect datasets are meaningful and comparable. Method-We analyze the five studies published in the IEEE Transactions on Software Engineering since 2007 that have utilized these datasets and compare the two versions of the datasets currently in use. Results-We find important differences between the two versions of the datasets, implausible values in one dataset and generally insufficient detail documented on dataset preprocessing. Conclusions-It is recommended that researchers 1) indicate the provenance of the datasets they use, 2) report any preprocessing in sufficient detail to enable meaningful replication, and 3) invest effort in understanding the data prior to applying machine learners
Constraints for Semistructured Data and XML
Integrity constraints play a fundamental role in database design. We review initial work on the expression of integrity constraints for semistructured data and XML
Verifying UML/OCL operation contracts
In current model-driven development approaches, software models are the primary artifacts of the development process. Therefore, assessment of their correctness is a key issue to ensure the quality of the final application. Research on model consistency has focused mostly on the models' static aspects. Instead, this paper addresses the verification of their dynamic aspects, expressed as a set of operations defined by means of pre/postcondition contracts. This paper presents an automatic method based on Constraint Programming to verify UML models extended with OCL constraints and operation contracts. In our approach, both static and dynamic aspects are translated into a Constraint Satisfaction Problem. Then, compliance of the operations with respect to several correctness properties such as operation executability or determinism are formally verified
Transformation Techniques for OCL Constraints
Constraints play a key role in the definition of conceptual schemas. In the UML, constraints are usually specified by means of invariants written in the OCL. However, due to the high expressiveness of the OCL, the designer has different syntactic alternatives to express each constraint. The techniques presented in this paper assist the designer during the definition of the constraints by means of generating equivalent alternatives for the initially defined ones. Moreover, in the context of the MDA, transformations between these different alternatives are required as part of the PIM-to-PIM, PIM-to-PSM or PIM-to-code transformations of the original conceptual schema
Completeness and Consistency Analysis for Evolving Knowledge Bases
Assessing the quality of an evolving knowledge base is a challenging task as
it often requires to identify correct quality assessment procedures.
Since data is often derived from autonomous, and increasingly large data
sources, it is impractical to manually curate the data, and challenging to
continuously and automatically assess their quality.
In this paper, we explore two main areas of quality assessment related to
evolving knowledge bases: (i) identification of completeness issues using
knowledge base evolution analysis, and (ii) identification of consistency
issues based on integrity constraints, such as minimum and maximum cardinality,
and range constraints.
For completeness analysis, we use data profiling information from consecutive
knowledge base releases to estimate completeness measures that allow predicting
quality issues. Then, we perform consistency checks to validate the results of
the completeness analysis using integrity constraints and learning models.
The approach has been tested both quantitatively and qualitatively by using a
subset of datasets from both DBpedia and 3cixty knowledge bases. The
performance of the approach is evaluated using precision, recall, and F1 score.
From completeness analysis, we observe a 94% precision for the English DBpedia
KB and 95% precision for the 3cixty Nice KB. We also assessed the performance
of our consistency analysis by using five learning models over three sub-tasks,
namely minimum cardinality, maximum cardinality, and range constraint. We
observed that the best performing model in our experimental setup is the Random
Forest, reaching an F1 score greater than 90% for minimum and maximum
cardinality and 84% for range constraints.Comment: Accepted for Journal of Web Semantic
A Simple Proportional Conflict Redistribution Rule
One proposes a first alternative rule of combination to WAO (Weighted Average
Operator) proposed recently by Josang, Daniel and Vannoorenberghe, called
Proportional Conflict Redistribution rule (denoted PCR1). PCR1 and WAO are
particular cases of WO (the Weighted Operator) because the conflicting mass is
redistributed with respect to some weighting factors. In this first PCR rule,
the proportionalization is done for each non-empty set with respect to the
non-zero sum of its corresponding mass matrix - instead of its mass column
average as in WAO, but the results are the same as Ph. Smets has pointed out.
Also, we extend WAO (which herein gives no solution) for the degenerate case
when all column sums of all non-empty sets are zero, and then the conflicting
mass is transferred to the non-empty disjunctive form of all non-empty sets
together; but if this disjunctive form happens to be empty, then one considers
an open world (i.e. the frame of discernment might contain new hypotheses) and
thus all conflicting mass is transferred to the empty set. In addition to WAO,
we propose a general formula for PCR1 (WAO for non-degenerate cases).Comment: 21 page
HoloDetect: Few-Shot Learning for Error Detection
We introduce a few-shot learning framework for error detection. We show that
data augmentation (a form of weak supervision) is key to training high-quality,
ML-based error detection models that require minimal human involvement. Our
framework consists of two parts: (1) an expressive model to learn rich
representations that capture the inherent syntactic and semantic heterogeneity
of errors; and (2) a data augmentation model that, given a small seed of clean
records, uses dataset-specific transformations to automatically generate
additional training data. Our key insight is to learn data augmentation
policies from the noisy input dataset in a weakly supervised manner. We show
that our framework detects errors with an average precision of ~94% and an
average recall of ~93% across a diverse array of datasets that exhibit
different types and amounts of errors. We compare our approach to a
comprehensive collection of error detection methods, ranging from traditional
rule-based methods to ensemble-based and active learning approaches. We show
that data augmentation yields an average improvement of 20 F1 points while it
requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages
- …