164 research outputs found

    Datalog± Ontology Consolidation

    Get PDF
    Knowledge bases in the form of ontologies are receiving increasing attention as they allow to clearly represent both the available knowledge, which includes the knowledge in itself and the constraints imposed to it by the domain or the users. In particular, Datalog ± ontologies are attractive because of their property of decidability and the possibility of dealing with the massive amounts of data in real world environments; however, as it is the case with many other ontological languages, their application in collaborative environments often lead to inconsistency related issues. In this paper we introduce the notion of incoherence regarding Datalog± ontologies, in terms of satisfiability of sets of constraints, and show how under specific conditions incoherence leads to inconsistent Datalog ± ontologies. The main contribution of this work is a novel approach to restore both consistency and coherence in Datalog± ontologies. The proposed approach is based on kernel contraction and restoration is performed by the application of incision functions that select formulas to delete. Nevertheless, instead of working over minimal incoherent/inconsistent sets encountered in the ontologies, our operators produce incisions over non-minimal structures called clusters. We present a construction for consolidation operators, along with the properties expected to be satisfied by them. Finally, we establish the relation between the construction and the properties by means of a representation theorem. Although this proposal is presented for Datalog± ontologies consolidation, these operators can be applied to other types of ontological languages, such as Description Logics, making them apt to be used in collaborative environments like the Semantic Web.Fil: Deagustini, Cristhian Ariel David. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Martinez, Maria Vanina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Falappa, Marcelo Alejandro. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Simari, Guillermo Ricardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentin

    Matching Dependencies with Arbitrary Attribute Values: Semantics, Query Answering and Integrity Constraints

    Full text link
    Matching dependencies (MDs) were introduced to specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the "pure case" of MDs, any value from the underlying data domain can be used for the value in common that does the matching. We investigate the semantics and properties of data cleaning through the enforcement of matching dependencies for the pure case. We characterize the intended clean instances and also the "clean answers" to queries as those that are invariant under the cleaning process. The complexity of computing clean instances and clean answers to queries is investigated. Tractable and intractable cases depending on the MDs and queries are identified. Finally, we establish connections with database "repairs" under integrity constraints.Comment: 13 pages, double column, 2 figure

    Data quality evaluation through data quality rules and data provenance.

    Get PDF
    The application and exploitation of large amounts of data play an ever-increasing role in today’s research, government, and economy. Data understanding and decision making heavily rely on high quality data; therefore, in many different contexts, it is important to assess the quality of a dataset in order to determine if it is suitable to be used for a specific purpose. Moreover, as the access to and the exchange of datasets have become easier and more frequent, and as scientists increasingly use the World Wide Web to share scientific data, there is a growing need to know the provenance of a dataset (i.e., information about the processes and data sources that lead to its creation) in order to evaluate its trustworthiness. In this work, data quality rules and data provenance are used to evaluate the quality of datasets. Concerning the first topic, the applied solution consists in the identification of types of data constraints that can be useful as data quality rules and in the development of a software tool to evaluate a dataset on the basis of a set of rules expressed in the XML markup language. We selected some of the data constraints and dependencies already considered in the data quality field, but we also used order dependencies and existence constraints as quality rules. In addition, we developed some algorithms to discover the types of dependencies used in the tool. To deal with the provenance of data, the Open Provenance Model (OPM) was adopted, an experimental query language for querying OPM graphs stored in a relational database was implemented, and an approach to design OPM graphs was proposed

    Modeling and Querying Uncertainty in Data Cleaning

    Get PDF
    Data quality problems such as duplicate records, missing values, and violation of integrity constrains frequently appear in real world applications. Such problems cost enterprises billions of dollars annually, and might have unpredictable consequences in mission-critical tasks. The process of data cleaning refers to detecting and correcting errors in data in order to improve the data quality. Numerous efforts have been taken towards improving the effectiveness and the efficiency of the data cleaning. A major challenge in the data cleaning process is the inherent uncertainty about the cleaning decisions that should be taken by the cleaning algorithms (e.g., deciding whether two records are duplicates or not). Existing data cleaning systems deal with the uncertainty in data cleaning decisions by selecting one alternative, based on some heuristics, while discarding (i.e., destroying) all other alternatives, which results in a false sense of certainty. Furthermore, because of the complex dependencies among cleaning decisions, it is difficult to reverse the process of destroying some alternatives (e.g., when new external information becomes available). In most cases, restarting the data cleaning from scratch is inevitable whenever we need to incorporate new evidence. To address the uncertainty in the data cleaning process, we propose a new approach, called probabilistic data cleaning, that views data cleaning as a random process whose possible outcomes are possible clean instances (i.e., repairs). Our approach generates multiple possible clean instances to avoid the destructive aspect of current cleaning systems. In this dissertation, we apply this approach in the context of two prominent data cleaning problems: duplicate elimination, and repairing violations of functional dependencies (FDs). First, we propose a probabilistic cleaning approach for the problem of duplicate elimination. We define a space of possible repairs that can be efficiently generated. To achieve this goal, we concentrate on a family of duplicate detection approaches that are based on parameterized hierarchical clustering algorithms. We propose a novel probabilistic data model that compactly encodes the defined space of possible repairs. We show how to efficiently answer relational queries using the set of possible repairs. We also define new types of queries that reason about the uncertainty in the duplicate elimination process. Second, in the context of repairing violations of FDs, we propose a novel data cleaning approach that allows sampling from a space of possible repairs. Initially, we contrast the existing definitions of possible repairs, and we propose a new definition of possible repairs that can be sampled efficiently. We present an algorithm that randomly samples from this space, and we present multiple optimizations to improve the performance of the sampling algorithm. Third, we show how to apply our probabilistic data cleaning approach in scenarios where both data and FDs are unclean (e.g., due to data evolution or inaccurate understanding of the data semantics). We propose a framework that simultaneously modifies the data and the FDs while satisfying multiple objectives, such as consistency of the resulting data with respect to the resulting FDs, (approximate) minimality of changes of data and FDs, and leveraging the trade-off between trusting the data and trusting the FDs. In presence of uncertainty in the relative trust in data versus FDs, we show how to extend our cleaning algorithm to efficiently generate multiple possible repairs, each of which corresponds to a different level of relative trust

    Personalizable Knowledge Integration

    Get PDF
    Large repositories of data are used daily as knowledge bases (KBs) feeding computer systems that support decision making processes, such as in medical or financial applications. Unfortunately, the larger a KB is, the harder it is to ensure its consistency and completeness. The problem of handling KBs of this kind has been studied in the AI and databases communities, but most approaches focus on computing answers locally to the KB, assuming there is some single, epistemically correct solution. It is important to recognize that for some applications, as part of the decision making process, users consider far more knowledge than that which is contained in the knowledge base, and that sometimes inconsistent data may help in directing reasoning; for instance, inconsistency in taxpayer records can serve as evidence of a possible fraud. Thus, the handling of this type of data needs to be context-sensitive, creating a synergy with the user in order to build useful, flexible data management systems. Inconsistent and incomplete information is ubiquitous and presents a substantial problem when trying to reason about the data: how can we derive an adequate model of the world, from the point of view of a given user, from a KB that may be inconsistent or incomplete? In this thesis we argue that in many cases users need to bring their application-specific knowledge to bear in order to inform the data management process. Therefore, we provide different approaches to handle, in a personalized fashion, some of the most common issues that arise in knowledge management. Specifically, we focus on (1) inconsistency management in relational databases, general knowledge bases, and a special kind of knowledge base designed for news reports; (2) management of incomplete information in the form of different types of null values; and (3) answering queries in the presence of uncertain schema matchings. We allow users to define policies to manage both inconsistent and incomplete information in their application in a way that takes both the user's knowledge of his problem, and his attitude to error/risk, into account. Using the frameworks and tools proposed here, users can specify when and how they want to manage/solve the issues that arise due to inconsistency and incompleteness in their data, in the way that best suits their needs

    The Fourth International VLDB Workshop on Management of Uncertain Data

    Get PDF
    corecore