30 research outputs found
Property Label Stability in Wikidata
International audienceStability in Wikidata's schema is essential for the reuse of its data. In this paper, we analyze the stability of the data based on the changes in labels of properties in six languages. We find that the schema is overall stable, making it a reliable resource for external usage
What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization
Knowledge graphs (KGs) store highly heterogeneous information about the world
in the structure of a graph, and are useful for tasks such as question
answering and reasoning. However, they often contain errors and are missing
information. Vibrant research in KG refinement has worked to resolve these
issues, tailoring techniques to either detect specific types of errors or
complete a KG.
In this work, we introduce a unified solution to KG characterization by
formulating the problem as unsupervised KG summarization with a set of
inductive, soft rules, which describe what is normal in a KG, and thus can be
used to identify what is abnormal, whether it be strange or missing. Unlike
first-order logic rules, our rules are labeled, rooted graphs, i.e., patterns
that describe the expected neighborhood around a (seen or unseen) node, based
on its type, and information in the KG. Stepping away from the traditional
support/confidence-based rule mining techniques, we propose KGist, Knowledge
Graph Inductive SummarizaTion, which learns a summary of inductive rules that
best compress the KG according to the Minimum Description Length principle---a
formulation that we are the first to use in the context of KG rule mining. We
apply our rules to three large KGs (NELL, DBpedia, and Yago), and tasks such as
compression, various types of error detection, and identification of incomplete
information. We show that KGist outperforms task-specific, supervised and
unsupervised baselines in error detection and incompleteness identification,
(identifying the location of up to 93% of missing entities---over 10% more than
baselines), while also being efficient for large knowledge graphs.Comment: 10 pages, plus 2 pages of references. 5 figures. Accepted at The Web
Conference 202
Maintenance des bases de connaissances à l’aide de contraintes
Knowledge bases are huge collections of primarily encyclopedic facts.They are widely used in entity recognition, structured search, question answering, and other tasks.These knowledge bases have to be curated, and this is a crucial but costly task.In this thesis, we are concerned with curating knowledge bases automatically using constraints.Our first contribution aims at discovering constraints automatically. We improve standard rule mining approaches by using (in-)completeness meta-information. We show that this information can increase the quality of the learned rules significantly. Our second contribution is the creation of a knowledge base, YAGO 4, where we statically enforce a set of constraints by removing the facts that do not comply with them. Our last contribution is a method to correct constraint violations automatically.Our method uses the edit history of the knowledge base to see how users corrected violations in the past, in order to propose corrections for the present.Les bases de connaissances sont des ensembles de faits, souvent sur des sujets encyclopédiques.Elles sont souvent utilisées pour la reconnaissance d'entités nommées, la recherche structurée, la réponse automatique à des questions, etc. Ces bases de connaissances doivent être maintenues, ce qui est une tâche cruciale mais coûteuse. Le sujet de cette thèse est la maintenance automatique de bases de connaissances à l'aide de contraintes. La première contribution de cette thèse est à propos de la découverte automatique de contraintes. Elle améliore les approches classiques d'apprentissage de règles en utilisant des méta-informations de complétude des données. Elle montre que que ces informations permettent d'améliorer de manière significative la qualité des règles trouvées. La seconde contribution est la création d'une base de connaissance, YAGO 4, qui assure le respect d'une série de contraintes en supprimant les faits qui n'y correspondent pas. La troisième contribution est une méthode pour corriger automatiquement les violations de contraintes.Cette méthode utilise l'historique des modifications de la base de connaissance afin de proposer des corrections, ceci à partir de la manière avec laquelle les utilisateurs de la base de connaissance ont déjà corrigé des violations similaires
Learning How to Correct a Knowledge Base from the Edit History
International audienceThe curation of a knowledge base is a crucial but costly task. In this work, we propose to take advantage of the edit history of the knowledge base in order to learn how to correct constraint violations. Our method is based on rule mining, and uses the edits that solved some violations in the past to infer how to solve similar violations in the present. The experimental evaluation of our method on Wikidata shows significant improvements over baselines
YAGO 4: A Reason-able Knowledge Base
International audienceYAGO is one of the large knowledge bases in the Linked Open Data cloud. In this resource paper, we present its latest version, YAGO 4, which reconciles the rigorous typing and constraints of schema.org with the rich instance data of Wikidata. The resulting resource contains 2 billion type-consistent triples for 64 Million entities, and has a consistent ontology that allows semantic reasoning with OWL 2 description logics
Question Answering Benchmarks for Wikidata
International audienceWikidata is becoming an increasingly important knowledge base whose usage is spreading in the research community. However, most question answering systems evaluation datasets rely on Freebase or DBpedia. We present two new datasets in order to train and benchmark QA systems over Wikidata. The first is a translation of the popular SimpleQuestions dataset to Wikidata, the second is a dataset created by collecting user feedbacks
Thymeflow, A Personal Knowledge Base with Spatio-Temporal Data
International audienceThe typical Internet user has data spread over several devices and across several online systems. We demonstrate an open-source system for integrating user's data from different sources into a single Knowledge Base. Our system integrates data of different kinds into a coherent whole, starting with email messages, calendar, contacts, and location history. It is able to detect event periods in the user's location data and align them with calendar events. We will demonstrate how to query the system within and across different dimensions, and perform analytics over emails, events, and locations
A Knowledge Base for Personal Information Management
International audienceInternet users have personal data spread over several devices and across several web systems. In this paper, we introduce a novel open-source framework for integrating the data of a user from different sources into a single knowledge base. Our framework integrates data of different kinds into a coherent whole, starting with email messages, calendar, contacts, and location history. We show how event periods in the user's location data can be detected and how they can be aligned with events from the calendar. This allows users to query their personal information within and across different dimensions, and to perform analytics over their emails, events, and locations. Our system models data using RDF, extending the schema.org vocabulary and providing a SPARQL interface
A Knowledge Base for Personal Information Management
International audienceInternet users have personal data spread over several devices and across several web systems. In this paper, we introduce a novel open-source framework for integrating the data of a user from different sources into a single knowledge base. Our framework integrates data of different kinds into a coherent whole, starting with email messages, calendar, contacts, and location history. We show how event periods in the user's location data can be detected and how they can be aligned with events from the calendar. This allows users to query their personal information within and across different dimensions, and to perform analytics over their emails, events, and locations. Our system models data using RDF, extending the schema.org vocabulary and providing a SPARQL interface
Platypus – A Multilingual Question Answering Platform for Wikidata
In this paper we present Platypus, a natural language question answering system. Our objective is to provide the research community with a production-ready multilingual question answering platform that targets Wiki-data, the largest general-purpose knowledge base on the Semantic Web. Our platform can answer complex queries in several languages, using hybrid grammatical and template based techniques