28 research outputs found
Transformer-based Subject Entity Detection in Wikipedia Listings
In tasks like question answering or text summarisation, it is essential to
have background knowledge about the relevant entities. The information about
entities - in particular, about long-tail or emerging entities - in publicly
available knowledge graphs like DBpedia or CaLiGraph is far from complete. In
this paper, we present an approach that exploits the semi-structured nature of
listings (like enumerations and tables) to identify the main entities of the
listing items (i.e., of entries and rows). These entities, which we call
subject entities, can be used to increase the coverage of knowledge graphs. Our
approach uses a transformer network to identify subject entities at the
token-level and surpasses an existing approach in terms of performance while
being bound by fewer limitations. Due to a flexible input format, it is
applicable to any kind of listing and is, unlike prior work, not dependent on
entity boundaries as input. We demonstrate our approach by applying it to the
complete Wikipedia corpus and extracting 40 million mentions of subject
entities with an estimated precision of 71% and recall of 77%. The results are
incorporated in the most recent version of CaLiGraph.Comment: Published at Deep Learning for Knowledge Graphs workshop (DL4KG) at
International Semantic Web Conference 2022 (ISWC 2022
NASTyLinker: NIL-Aware Scalable Transformer-based Entity Linker
Entity Linking (EL) is the task of detecting mentions of entities in text and
disambiguating them to a reference knowledge base. Most prevalent EL approaches
assume that the reference knowledge base is complete. In practice, however, it
is necessary to deal with the case of linking to an entity that is not
contained in the knowledge base (NIL entity). Recent works have shown that,
instead of focusing only on affinities between mentions and entities,
considering inter-mention affinities can be used to represent NIL entities by
producing clusters of mentions. At the same time, inter-mention affinities can
help to substantially improve linking performance for known entities. With
NASTyLinker, we introduce an EL approach that is aware of NIL entities and
produces corresponding mention clusters while maintaining high linking
performance for known entities. The approach clusters mentions and entities
based on dense representations from Transformers and resolves conflicts (if
more than one entity is assigned to a cluster) by computing transitive
mention-entity affinities. We show the effectiveness and scalability of
NASTyLinker on NILK, a dataset that is explicitly constructed to evaluate EL
with respect to NIL entities. Further, we apply the presented approach to an
actual EL task, namely to knowledge graph population by linking entities in
Wikipedia listings, and provide an analysis of the outcome.Comment: Preprint of a paper in the research track of the 20th Extended
Semantic Web Conference (ESWC'23
KGrEaT: A Framework to Evaluate Knowledge Graphs via Downstream Tasks
In recent years, countless research papers have addressed the topics of
knowledge graph creation, extension, or completion in order to create knowledge
graphs that are larger, more correct, or more diverse. This research is
typically motivated by the argumentation that using such enhanced knowledge
graphs to solve downstream tasks will improve performance. Nonetheless, this is
hardly ever evaluated. Instead, the predominant evaluation metrics - aiming at
correctness and completeness - are undoubtedly valuable but fail to capture the
complete picture, i.e., how useful the created or enhanced knowledge graph
actually is. Further, the accessibility of such a knowledge graph is rarely
considered (e.g., whether it contains expressive labels, descriptions, and
sufficient context information to link textual mentions to the entities of the
knowledge graph). To better judge how well knowledge graphs perform on actual
tasks, we present KGrEaT - a framework to estimate the quality of knowledge
graphs via actual downstream tasks like classification, clustering, or
recommendation. Instead of comparing different methods of processing knowledge
graphs with respect to a single task, the purpose of KGrEaT is to compare
various knowledge graphs as such by evaluating them on a fixed task setup. The
framework takes a knowledge graph as input, automatically maps it to the
datasets to be evaluated on, and computes performance metrics for the defined
tasks. It is built in a modular way to be easily extendable with additional
tasks and datasets.Comment: Accepted for the Short Paper track of CIKM'23, October 21-25, 2023,
Birmingham, United Kingdo
Information extraction from co-occurring similar entities
Knowledge about entities and their interrelations is a crucial factor of success for tasks like question answering or text summarization. Publicly available knowledge graphs like Wikidata or DBpedia are, however, far from being complete. In this paper, we explore how information extracted from similar entities that co-occur in structures like tables or lists can help to increase the coverage of such knowledge graphs. In contrast to existing approaches, we do not focus on relationships within a listing (e.g., between two entities in a table row) but on the relationship between a listing’s subject entities and the context of the listing. To that end, we propose a descriptive rule mining approach that uses distant supervision to derive rules for these relationships based on a listing’s context. Extracted from a suitable data corpus, the rules can be used to extend a knowledge graph with novel entities and assertions. In our experiments we demonstrate that the approach is able to extract up to 3M novel entities and 30M additional assertions from listings in Wikipedia. We find that the extracted information is of high quality and thus suitable to extend Wikipedia-based knowledge graphs like DBpedia, YAGO, and CaLiGraph. For the case of DBpedia, this would result in an increase of covered entities by roughly 50%
The CaLiGraph ontology as a challenge for OWL reasoners
CaLiGraph is a large-scale cross-domain knowledge graph generated from Wikipedia by exploiting the category system, list pages, and other list structures in Wikipedia, containing more than 15 million typed entities and around 10 million relation assertions. Other than knowledge graphs such as DBpedia and YAGO, whose ontologies are comparably simplistic, CaLiGraph also has a rich ontology, comprising more than 200,000 class restrictions. Those two properties – a large A-box and a rich ontology – make it an interesting challenge for benchmarking reasoners. In this paper, we show that a reasoning task which is particularly relevant for CaLiGraph, i.e., the materialization of owl:hasValue constraints into assertions between individuals and between individuals and literals, is insufficiently supported by available reasoning systems. We provide differently sized benchmark subsets of CaLiGraph, which can be used for performance analysis of reasoning systems
Language-agnostic relation extraction from abstracts in Wikis
Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We demonstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph