13,237 research outputs found
Correcting Knowledge Base Assertions
The usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB
Automatic refinement of large-scale cross-domain knowledge graphs
Knowledge graphs are a way to represent complex structured and unstructured information
integrated into an ontology, with which one can reason about the existing
information to deduce new information or highlight inconsistencies. Knowledge
graphs are divided into the terminology box (TBox), also known as ontology, and
the assertions box (ABox). The former consists of a set of schema axioms defining
classes and properties which describe the data domain. Whereas the ABox consists
of a set of facts describing instances in terms of the TBox vocabulary.
In the recent years, there have been several initiatives for creating large-scale
cross-domain knowledge graphs, both free and commercial, with DBpedia, YAGO,
and Wikidata being amongst the most successful free datasets. Those graphs are
often constructed with the extraction of information from semi-structured knowledge,
such as Wikipedia, or unstructured text from the web using NLP methods. It
is unlikely, in particular when heuristic methods are applied and unreliable sources
are used, that the knowledge graph is fully correct or complete. There is a tradeoff
between completeness and correctness, which is addressed differently in each
knowledge graph’s construction approach.
There is a wide variety of applications for knowledge graphs, e.g. semantic
search and discovery, question answering, recommender systems, expert systems
and personal assistants. The quality of a knowledge graph is crucial for its applications.
In order to further increase the quality of such large-scale knowledge graphs,
various automatic refinement methods have been proposed. Those methods try to
infer and add missing knowledge to the graph, or detect erroneous pieces of information.
In this thesis, we investigate the problem of automatic knowledge graph
refinement and propose methods that address the problem from two directions, automatic
refinement of the TBox and of the ABox.
In Part I we address the ABox refinement problem. We propose a method for
predicting missing type assertions using hierarchical multilabel classifiers and ingoing/
outgoing links as features. We also present an approach to detection of relation
assertion errors which exploits type and path patterns in the graph. Moreover,
we propose an approach to correction of relation errors originating from confusions
between entities. Also in the ABox refinement direction, we propose a knowledge
graph model and process for synthesizing knowledge graphs for benchmarking
ABox completion methods.
In Part II we address the TBox refinement problem. We propose methods for inducing flexible relation constraints from the ABox, which are expressed using
SHACL.We introduce an ILP refinement step which exploits correlations between
numerical attributes and relations in order to the efficiently learn Horn rules with
numerical attributes. Finally, we investigate the introduction of lexical information
from textual corpora into the ILP algorithm in order to improve quality of induced
class expressions
Going Deeper with Semantics: Video Activity Interpretation using Semantic Contextualization
A deeper understanding of video activities extends beyond recognition of
underlying concepts such as actions and objects: constructing deep semantic
representations requires reasoning about the semantic relationships among these
concepts, often beyond what is directly observed in the data. To this end, we
propose an energy minimization framework that leverages large-scale commonsense
knowledge bases, such as ConceptNet, to provide contextual cues to establish
semantic relationships among entities directly hypothesized from video signal.
We mathematically express this using the language of Grenander's canonical
pattern generator theory. We show that the use of prior encoded commonsense
knowledge alleviate the need for large annotated training datasets and help
tackle imbalance in training through prior knowledge. Using three different
publicly available datasets - Charades, Microsoft Visual Description Corpus and
Breakfast Actions datasets, we show that the proposed model can generate video
interpretations whose quality is better than those reported by state-of-the-art
approaches, which have substantial training needs. Through extensive
experiments, we show that the use of commonsense knowledge from ConceptNet
allows the proposed approach to handle various challenges such as training data
imbalance, weak features, and complex semantic relationships and visual scenes.Comment: Accepted to WACV 201
Regression-free Synthesis for Concurrency
While fixing concurrency bugs, program repair algorithms may introduce new
concurrency bugs. We present an algorithm that avoids such regressions. The
solution space is given by a set of program transformations we consider in for
repair process. These include reordering of instructions within a thread and
inserting atomic sections. The new algorithm learns a constraint on the space
of candidate solutions, from both positive examples (error-free traces) and
counterexamples (error traces). From each counterexample, the algorithm learns
a constraint necessary to remove the errors. From each positive examples, it
learns a constraint that is necessary in order to prevent the repair from
turning the trace into an error trace. We implemented the algorithm and
evaluated it on simplified Linux device drivers with known bugs.Comment: for source code see https://github.com/thorstent/ConRepai
What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization
Knowledge graphs (KGs) store highly heterogeneous information about the world
in the structure of a graph, and are useful for tasks such as question
answering and reasoning. However, they often contain errors and are missing
information. Vibrant research in KG refinement has worked to resolve these
issues, tailoring techniques to either detect specific types of errors or
complete a KG.
In this work, we introduce a unified solution to KG characterization by
formulating the problem as unsupervised KG summarization with a set of
inductive, soft rules, which describe what is normal in a KG, and thus can be
used to identify what is abnormal, whether it be strange or missing. Unlike
first-order logic rules, our rules are labeled, rooted graphs, i.e., patterns
that describe the expected neighborhood around a (seen or unseen) node, based
on its type, and information in the KG. Stepping away from the traditional
support/confidence-based rule mining techniques, we propose KGist, Knowledge
Graph Inductive SummarizaTion, which learns a summary of inductive rules that
best compress the KG according to the Minimum Description Length principle---a
formulation that we are the first to use in the context of KG rule mining. We
apply our rules to three large KGs (NELL, DBpedia, and Yago), and tasks such as
compression, various types of error detection, and identification of incomplete
information. We show that KGist outperforms task-specific, supervised and
unsupervised baselines in error detection and incompleteness identification,
(identifying the location of up to 93% of missing entities---over 10% more than
baselines), while also being efficient for large knowledge graphs.Comment: 10 pages, plus 2 pages of references. 5 figures. Accepted at The Web
Conference 202
The IBMAP approach for Markov networks structure learning
In this work we consider the problem of learning the structure of Markov
networks from data. We present an approach for tackling this problem called
IBMAP, together with an efficient instantiation of the approach: the IBMAP-HC
algorithm, designed for avoiding important limitations of existing
independence-based algorithms. These algorithms proceed by performing
statistical independence tests on data, trusting completely the outcome of each
test. In practice tests may be incorrect, resulting in potential cascading
errors and the consequent reduction in the quality of the structures learned.
IBMAP contemplates this uncertainty in the outcome of the tests through a
probabilistic maximum-a-posteriori approach. The approach is instantiated in
the IBMAP-HC algorithm, a structure selection strategy that performs a
polynomial heuristic local search in the space of possible structures. We
present an extensive empirical evaluation on synthetic and real data, showing
that our algorithm outperforms significantly the current independence-based
algorithms, in terms of data efficiency and quality of learned structures, with
equivalent computational complexities. We also show the performance of IBMAP-HC
in a real-world application of knowledge discovery: EDAs, which are
evolutionary algorithms that use structure learning on each generation for
modeling the distribution of populations. The experiments show that when
IBMAP-HC is used to learn the structure, EDAs improve the convergence to the
optimum
- …