78 research outputs found
Correcting Knowledge Base Assertions
The usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB
Engineering Agile Big-Data Systems
To be effective, data-intensive systems require extensive ongoing customisation to reflect changing user requirements, organisational policies, and the structure and interpretation of the data they hold. Manual customisation is expensive, time-consuming, and error-prone. In large complex systems, the value of the data can be such that exhaustive testing is necessary before any new feature can be added to the existing design. In most cases, the precise details of requirements, policies and data will change during the lifetime of the system, forcing a choice between expensive modification and continued operation with an inefficient design.Engineering Agile Big-Data Systems outlines an approach to dealing with these problems in software and data engineering, describing a methodology for aligning these processes throughout product lifecycles. It discusses tools which can be used to achieve these goals, and, in a number of case studies, shows how the tools and methodology have been used to improve a variety of academic and business systems
Engineering Agile Big-Data Systems
To be effective, data-intensive systems require extensive ongoing customisation to reflect changing user requirements, organisational policies, and the structure and interpretation of the data they hold. Manual customisation is expensive, time-consuming, and error-prone. In large complex systems, the value of the data can be such that exhaustive testing is necessary before any new feature can be added to the existing design. In most cases, the precise details of requirements, policies and data will change during the lifetime of the system, forcing a choice between expensive modification and continued operation with an inefficient design.Engineering Agile Big-Data Systems outlines an approach to dealing with these problems in software and data engineering, describing a methodology for aligning these processes throughout product lifecycles. It discusses tools which can be used to achieve these goals, and, in a number of case studies, shows how the tools and methodology have been used to improve a variety of academic and business systems
Recommended from our members
An assertion and alignment correction framework for large scale knowledge bases
Various knowledge bases (KBs) have been constructed via information extraction from encyclopedias, text and tables, as well as alignment of multiple sources. Their usefulness and usability is often limited by quality issues. One common issue is the presence of erroneous assertions and alignments, often caused by lexical or semantic confusion. We study the problem of correcting such assertions and alignments, and present a general correction framework which combines lexical matching, contextaware sub-KB extraction, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated with one set of literal assertions from DBpedia, one set of entity assertions from an enterprise medical KB, and one set of mapping assertions from a music KB constructed by integrating Wikidata, Discogs and MusicBrainz. It has achieved promising results, with a correction rate (i.e., the ratio of the target assertions/alignments that are corrected with right substitutes) of 70.1%, 60.9% and 71.8%, respectively
Fusing Automatically Extracted Annotations for the Semantic Web
This research focuses on the problem of semantic data fusion. Although various solutions have been developed in the research communities focusing on databases and formal logic, the choice of an appropriate algorithm is non-trivial because the performance of each algorithm and its optimal configuration parameters depend on the type of data, to which the algorithm is applied. In order to be reusable, the fusion system must be able to select appropriate techniques and use them in combination.
Moreover, because of the varying reliability of data sources and algorithms performing fusion subtasks, uncertainty is an inherent feature of semantically annotated data and has to be taken into account by the fusion system. Finally, the issue of schema heterogeneity can have a negative impact on the fusion performance. To address these issues, we propose KnoFuss: an architecture for Semantic Web data integration based on the principles of problem-solving methods. Algorithms dealing with different fusion subtasks are represented as components of a modular architecture, and their capabilities are described formally. This allows the architecture to select appropriate methods and configure them depending on the processed data. In order to handle uncertainty, we propose a novel algorithm based on the Dempster-Shafer belief propagation. KnoFuss employs this algorithm to reason about uncertain data and method results in order to refine the fused knowledge base. Tests show that these solutions lead to improved fusion performance. Finally, we addressed the problem of data fusion in the presence of schema heterogeneity. We extended the KnoFuss framework to exploit results of automatic schema alignment tools and proposed our own schema matching algorithm aimed at facilitating data fusion in the Linked Data environment. We conducted experiments with this approach and obtained a substantial improvement in performance in comparison with public data repositories
Automatic refinement of large-scale cross-domain knowledge graphs
Knowledge graphs are a way to represent complex structured and unstructured information
integrated into an ontology, with which one can reason about the existing
information to deduce new information or highlight inconsistencies. Knowledge
graphs are divided into the terminology box (TBox), also known as ontology, and
the assertions box (ABox). The former consists of a set of schema axioms defining
classes and properties which describe the data domain. Whereas the ABox consists
of a set of facts describing instances in terms of the TBox vocabulary.
In the recent years, there have been several initiatives for creating large-scale
cross-domain knowledge graphs, both free and commercial, with DBpedia, YAGO,
and Wikidata being amongst the most successful free datasets. Those graphs are
often constructed with the extraction of information from semi-structured knowledge,
such as Wikipedia, or unstructured text from the web using NLP methods. It
is unlikely, in particular when heuristic methods are applied and unreliable sources
are used, that the knowledge graph is fully correct or complete. There is a tradeoff
between completeness and correctness, which is addressed differently in each
knowledge graph’s construction approach.
There is a wide variety of applications for knowledge graphs, e.g. semantic
search and discovery, question answering, recommender systems, expert systems
and personal assistants. The quality of a knowledge graph is crucial for its applications.
In order to further increase the quality of such large-scale knowledge graphs,
various automatic refinement methods have been proposed. Those methods try to
infer and add missing knowledge to the graph, or detect erroneous pieces of information.
In this thesis, we investigate the problem of automatic knowledge graph
refinement and propose methods that address the problem from two directions, automatic
refinement of the TBox and of the ABox.
In Part I we address the ABox refinement problem. We propose a method for
predicting missing type assertions using hierarchical multilabel classifiers and ingoing/
outgoing links as features. We also present an approach to detection of relation
assertion errors which exploits type and path patterns in the graph. Moreover,
we propose an approach to correction of relation errors originating from confusions
between entities. Also in the ABox refinement direction, we propose a knowledge
graph model and process for synthesizing knowledge graphs for benchmarking
ABox completion methods.
In Part II we address the TBox refinement problem. We propose methods for inducing flexible relation constraints from the ABox, which are expressed using
SHACL.We introduce an ILP refinement step which exploits correlations between
numerical attributes and relations in order to the efficiently learn Horn rules with
numerical attributes. Finally, we investigate the introduction of lexical information
from textual corpora into the ILP algorithm in order to improve quality of induced
class expressions
Knowledge base ontological debugging guided by linguistic evidence
Le résumé en français n'a pas été communiqué par l'auteur.When they grow in size, knowledge bases (KBs) tend to include sets of axioms which are intuitively absurd but nonetheless logically consistent. This is particularly true of data expressed in OWL, as part of the Semantic Web framework, which favors the aggregation of set of statements from multiple sources of knowledge, with overlapping signatures.Identifying nonsense is essential if one wants to avoid undesired inferences, but the sparse usage of negation within these datasets generally prevents the detection of such cases on a strict logical basis. And even if the KB is inconsistent, identifying the axioms responsible for the nonsense remains a non trivial task. This thesis investigates the usage of automatically gathered linguistic evidence in order to detect and repair violations of common sense within such datasets. The main intuition consists in exploiting distributional similarity between named individuals of an input KB, in order to identify consequences which are unlikely to hold if the rest of the KB does. Then the repair phase consists in selecting axioms to be preferably discarded (or at least amended) in order to get rid of the nonsense. A second strategy is also presented, which consists in strengthening the input KB with a foundational ontology, in order to obtain an inconsistency, before performing a form of knowledge base debugging/revision which incorporates this linguistic input. This last step may also be applied directly to an inconsistent input KB. These propositions are evaluated with different sets of statements issued from the Linked Open Data cloud, as well as datasets of a higher quality, but which were automatically degraded for the evaluation. The results seem to indicate that distributional evidence may actually constitute a relevant common ground for deciding between conflicting axioms
Knowledge base ontological debugging guided by linguistic evidence
Le résumé en français n'a pas été communiqué par l'auteur.When they grow in size, knowledge bases (KBs) tend to include sets of axioms which are intuitively absurd but nonetheless logically consistent. This is particularly true of data expressed in OWL, as part of the Semantic Web framework, which favors the aggregation of set of statements from multiple sources of knowledge, with overlapping signatures.Identifying nonsense is essential if one wants to avoid undesired inferences, but the sparse usage of negation within these datasets generally prevents the detection of such cases on a strict logical basis. And even if the KB is inconsistent, identifying the axioms responsible for the nonsense remains a non trivial task. This thesis investigates the usage of automatically gathered linguistic evidence in order to detect and repair violations of common sense within such datasets. The main intuition consists in exploiting distributional similarity between named individuals of an input KB, in order to identify consequences which are unlikely to hold if the rest of the KB does. Then the repair phase consists in selecting axioms to be preferably discarded (or at least amended) in order to get rid of the nonsense. A second strategy is also presented, which consists in strengthening the input KB with a foundational ontology, in order to obtain an inconsistency, before performing a form of knowledge base debugging/revision which incorporates this linguistic input. This last step may also be applied directly to an inconsistent input KB. These propositions are evaluated with different sets of statements issued from the Linked Open Data cloud, as well as datasets of a higher quality, but which were automatically degraded for the evaluation. The results seem to indicate that distributional evidence may actually constitute a relevant common ground for deciding between conflicting axioms
- …