8 research outputs found
Federated knowledge base debugging in DL-Lite A
Due to the continuously growing amount of data the federation of different and distributed data sources gained increasing attention. In order to tackle the challenge of federating heterogeneous sources a variety of approaches has been proposed. Especially in the context of the Semantic Web the application of Description Logics is one of the preferred methods to model federated knowledge based on a well-defined syntax and semantics. However, the more data are available from heterogeneous sources, the higher the risk is of inconsistency – a serious obstacle for performing reasoning tasks and query answering over a federated knowledge base. Given a single knowledge base the process of knowledge base debugging comprising the identification and resolution of conflicting statements have been widely studied while the consideration of federated settings integrating a network of loosely coupled data sources (such as LOD sources) has mostly been neglected.
In this thesis we tackle the challenging problem of debugging federated knowledge bases and focus on a lightweight Description Logic language, called DL-LiteA, that is aimed at applications requiring efficient and scalable reasoning. After introducing formal foundations such as Description Logics and Semantic Web technologies we clarify the motivating context of this work and discuss the general problem of information integration based on Description Logics.
The main part of this thesis is subdivided into three subjects. First, we discuss the specific characteristics of federated knowledge bases and provide an appropriate approach for detecting and explaining contradictive statements in a federated DL-LiteA knowledge base. Second, we study the representation of the identified conflicts and their relationships as a conflict graph and propose an approach for repair generation based on majority voting and statistical evidences. Third, in order to provide an alternative way for handling inconsistency in federated DL-LiteA knowledge bases we propose an automated approach for assessing adequate trust values (i.e., probabilities) at different levels of granularity by leveraging probabilistic inference over a graphical model.
In the last part of this thesis, we evaluate the previously developed algorithms against a set of large distributed LOD sources. In the course of discussing the experimental results, it turns out that the proposed approaches are sufficient, efficient and scalable with respect to real-world scenarios. Moreover, due to the exploitation of the federated structure in our algorithms it further becomes apparent that the number of identified wrong statements, the quality of the generated repair as well as the fineness of the assessed trust values profit from an increasing number of integrated sources
Correcting Knowledge Base Assertions
The usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB
Recommended from our members
An assertion and alignment correction framework for large scale knowledge bases
Various knowledge bases (KBs) have been constructed via information extraction from encyclopedias, text and tables, as well as alignment of multiple sources. Their usefulness and usability is often limited by quality issues. One common issue is the presence of erroneous assertions and alignments, often caused by lexical or semantic confusion. We study the problem of correcting such assertions and alignments, and present a general correction framework which combines lexical matching, contextaware sub-KB extraction, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated with one set of literal assertions from DBpedia, one set of entity assertions from an enterprise medical KB, and one set of mapping assertions from a music KB constructed by integrating Wikidata, Discogs and MusicBrainz. It has achieved promising results, with a correction rate (i.e., the ratio of the target assertions/alignments that are corrected with right substitutes) of 70.1%, 60.9% and 71.8%, respectively
Pseudo-contractions as Gentle Repairs
Updating a knowledge base to remove an unwanted consequence is a challenging task. Some of the original sentences must be either deleted or weakened in such a way that the sentence to be removed is no longer entailed by the resulting set. On the other hand, it is desirable that the existing knowledge be preserved as much as possible, minimising the loss of information. Several approaches to this problem can be found in the literature. In particular, when the knowledge is represented by an ontology, two different families of frameworks have been developed in the literature in the past decades with numerous ideas in common but with little interaction between the communities: applications of AGM-like Belief Change and justification-based Ontology Repair. In this paper, we investigate the relationship between pseudo-contraction operations and gentle repairs. Both aim to avoid the complete deletion of sentences when replacing them with weaker versions is enough to prevent the entailment of the unwanted formula. We show the correspondence between concepts on both sides and investigate under which conditions they are equivalent. Furthermore, we propose a unified notation for the two approaches, which might contribute to the integration of the two areas
Automated Reasoning
This volume, LNAI 13385, constitutes the refereed proceedings of the 11th International Joint Conference on Automated Reasoning, IJCAR 2022, held in Haifa, Israel, in August 2022. The 32 full research papers and 9 short papers presented together with two invited talks were carefully reviewed and selected from 85 submissions. The papers focus on the following topics: Satisfiability, SMT Solving,Arithmetic; Calculi and Orderings; Knowledge Representation and Jutsification; Choices, Invariance, Substitutions and Formalization; Modal Logics; Proofs System and Proofs Search; Evolution, Termination and Decision Prolems. This is an open access book
Scalable Quality Assessment of Linked Data
In a world where the information economy is booming, poor data quality can lead to adverse consequences, including social and economical problems such as decrease in revenue. Furthermore, data-driven indus- tries are not just relying on their own (proprietary) data silos, but are also continuously aggregating data from different sources. This aggregation could then be re-distributed back to “data lakes”. However, this data (including Linked Data) is not necessarily checked for its quality prior to its use. Large volumes of data are being exchanged in a standard and interoperable format between organisations and published as Linked Data to facilitate their re-use. Some organisations, such as government institutions, take a step further and open their data. The Linked Open Data Cloud is a witness to this. However, similar to data in data lakes, it is challenging to determine the quality of this heterogeneous data, and subsequently to make this information explicit to data consumers. Despite the availability of a number of tools and frameworks to assess Linked Data quality, the current solutions do not aggregate a holistic approach that enables both the assessment of datasets and also provides consumers with quality results that can then be used to find, compare and rank datasets’ fitness for use. In this thesis we investigate methods to assess the quality of (possibly large) linked datasets with the intent that data consumers can then use the assessment results to find datasets that are fit for use, that is; finding the right dataset for the task at hand. Moreover, the benefits of quality assessment are two-fold: (1) data consumers do not need to blindly rely on subjective measures to choose a dataset, but base their choice on multiple factors such as the intrinsic structure of the dataset, therefore fostering trust and reputation between the publishers and consumers on more objective foundations; and (2) data publishers can be encouraged to improve their datasets so that they can be re-used more. Furthermore, our approach scales for large datasets. In this regard, we also look into improving the efficiency of quality metrics using various approximation techniques. However the trade-off is that consumers will not get the exact quality value, but a very close estimate which anyway provides the required guidance towards fitness for use. The central point of this thesis is not on data quality improvement, nonetheless, we still need to understand what data quality means to the consumers who are searching for potential datasets. This thesis looks into the challenges faced to detect quality problems in linked datasets presenting quality results in a standardised machine-readable and interoperable format for which agents can make sense out of to help human consumers identifying the fitness for use dataset. Our proposed approach is more consumer-centric where it looks into (1) making the assessment of quality as easy as possible, that is, allowing stakeholders, possibly non-experts, to identify and easily define quality metrics and to initiate the assessment; and (2) making results (quality metadata and quality reports) easy for stakeholders to understand, or at least interoperable with other systems to facilitate a possible data quality pipeline. Finally, our framework is used to assess the quality of a number of heterogeneous (large) linked datasets, where each assessment returns a quality metadata graph that can be consumed by agents as Linked Data. In turn, these agents can intelligently interpret a dataset’s quality with regard to multiple dimensions and observations, and thus provide further insight to consumers regarding its fitness for use
Πλαίσιο εντοπισμού και επιδιόθρωσης λαθών σε γνωσιακές βάσεις DL-LiteA
Στη βιβλιογραφία έχουν προταθεί αρκετοί λογικοί φορμαλισμοί με σκοπό την έκφραση δομικών και σημασιολογικών περιορισμών ακεραιότητας, στο πλαίσιο των Δια-συνδεδεμένων Ανοιχτών Δεδομένων (Linked Open Data - LOD). Ωστόσο, η ανάγκη βελτίωσης της ποιότητας των δεδομένων που δημοσιεύονται στο σύννεφο Διασυνδεδεμένων Ανοιχτών Δεδομένων (LOD Cloud) παραμένει, καθώς τα δημοσιευμένα α¬νοιχτά δεδομένα συχνά παραβιάζουν τέτοιου είδους περιορισμούς ακεραιότητας. Αυτή η έλλειψη συνέπειας των δεδομένων με τους περιορισμούς, μπορεί να θέσει σε κίνδυνο την αξία εφαρμογών που καταναλώνουν ανοιχτά δεδομένα με αυτόματο τρόπο.
Μία βασική πρόκληση για τη βελτίωση της ποιότητας των δεδομένων, είναι η παροχή στους διαχειριστές των γνωσιακών βάσεων, εργαλείων που θα τους βοηθούν στον εντοπισμό παραβιάσεων των περιορισμών ακεραιότητας, καθώς και στην επίλυση τέτοιων παραβιάσεων.
Στην εργασία αυτή προτείνεται ένα νέο, πλήρως αυτόματοματοποιημένο πλαίσιο εντοπισμού παραβιάσεων περιορισμών ακεραιότητας σε γνωσιακές βάσεις, εκτελών¬τας τις απαραίτητες επερωτήσεις, καθώς και επιδιόρθωσης αυτών των παραβιάσεων, αφαιρώντας ασυνεπή δεδομένα από τη γνωσιακή βάση. Η μέθοδος που παρουσιάζε¬ται, λαμβάνει υπόψη την οντολογική γνώση που είτε προκύπτει από ρητές δηλώσεις, είτε προκύπτει σα συμπέρασμα από συνδυασμό ρητών δηλώσεων, χρησιμοποιώντας τη γλώσσα οντολογιών DL-LiteA για την έκφραση χρήσιμων λογικών περιορισμών, καθώς και για τον εντοπισμό δεδομένων που είναι ασυνεπή με αυτούς τους περιορισμούς, διατηρώντας, παράλληλα, καλές ιδιότητες υπολογιστικής πολυπλοκότητας.
Το πλαίσιο που προτείνεται αποτελείται από συστατικά μέρη που μπορούν να υλοποιηθούν ανεξάρτητα το ένα με το άλλο, δίνοντας έτσι τη δυνατότητα για τη χρησιμοποίηση έτοιμων, σύγχρονων, βελτιστοποιημένων εργαλείων για διάφορες λειτουργίες, όπως είναι η εκτέλεση επερωτήσεων.
Στα πλαίσια αυτής της εργασίας, παρουσιάζεται η υλοποίηση του πλαισίου, κα¬θώς και η αξιολόγηση της επίδοσής του, από την οποία εξάγεται το συμπέρασμα ότι μπορεί να χρησιμοποιηθεί για μεγάλα σύνολα δεδομένων και για μεγάλους αριθμούς παραβιάσεων, που παρατηρούνται στην πραγματικότητα σε γνωστές γνωσιακές βάσεις αναφοράς, όπως η DBpedia. Από την αξιολόγηση εξάγεται, επίσης, το συμπέρασμα πως το πλαίσιο που παρουσιάζεται σε αυτή την εργασία μπορεί να χρησιμοποιηθεί πάνω από γνωσιακές βάσεις που είναι ήδη σε λειτουργία, χωρίς καμία επιπλέον παραμετροποίηση.Several logical formalisms have been proposed in the literature for expressing
structural and semantic integrity constraints of Linked Open Data (LOD). Still,
the data quality of the datasets published in the LOD Cloud needs to be improved,
as published linked data often violate such constraints. This lack of consistency
may jeopardise the value of applications consuming linked data in an automatic
way.A major challenge in this respect, is to provide to the curators of linked data
knowledge bases (KBs), the tools that will help them in detecting the violations of
integrity constraints and in resolving them, in order to render the knowledge base
valid and improve its data quality.
In this work, we propose a novel, fully automatic framework for detecting violations
of integrity constraints (diagnosis) in KBs, by executing the appropriate
queries over the data, as well as for resolving those violations (repair ), by removing
invalid data from the KB. Our approach takes into consideration both explicit and
inferred ontology knowledge, by relying on the ontology language DL-LiteA for the
expression of several useful types of logical constraints and for the detection of data
that are inconsistent with those constraints, while maintaining good computational
properties.
The framework that is proposed in this work is modular, allowing each component
to be implemented in a manner independent to the other components. This
way, we are able to implement our framework with using off-the-shelf, state-of-theart
tools for several features, such as reasoning, query execution, etc.
We have implemented and evaluated our framework, showing that it is scalable
for large datasets and numbers of invalidities, which are exhibited in reality by
reference linked datasets, such as DBpedia. The evaluation also shows that our
framework can be used over already deployed knowledge bases, without any further
reconfiguration