9 research outputs found

    Vermeidung von ReprÀsentationsheterogenitÀten in realweltlichen Wissensgraphen

    Get PDF
    Knowledge graphs are repositories providing factual knowledge about entities. They are a great source of knowledge to support modern AI applications for Web search, question answering, digital assistants, and online shopping. The advantages of machine learning techniques and the Web's growth have led to colossal knowledge graphs with billions of facts about hundreds of millions of entities collected from a large variety of sources. While integrating independent knowledge sources promises rich information, it inherently leads to heterogeneities in representation due to a large variety of different conceptualizations. Thus, real-world knowledge graphs are threatened in their overall utility. Due to their sheer size, they are hardly manually curatable anymore. Automatic and semi-automatic methods are needed to cope with these vast knowledge repositories. We first address the general topic of representation heterogeneity by surveying the problem throughout various data-intensive fields: databases, ontologies, and knowledge graphs. Different techniques for automatically resolving heterogeneity issues are presented and discussed, while several open problems are identified. Next, we focus on entity heterogeneity. We show that automatic matching techniques may run into quality problems when working in a multi-knowledge graph scenario due to incorrect transitive identity links. We present four techniques that can be used to improve the quality of arbitrary entity matching tools significantly. Concerning relation heterogeneity, we show that synonymous relations in knowledge graphs pose several difficulties in querying. Therefore, we resolve these heterogeneities with knowledge graph embeddings and by Horn rule mining. All methods detect synonymous relations in knowledge graphs with high quality. Furthermore, we present a novel technique for avoiding heterogeneity issues at query time using implicit knowledge storage. We show that large neural language models are a valuable source of knowledge that is queried similarly to knowledge graphs already solving several heterogeneity issues internally.Wissensgraphen sind eine wichtige Datenquelle von EntitĂ€tswissen. Sie unterstĂŒtzen viele moderne KI-Anwendungen. Dazu gehören unter anderem Websuche, die automatische Beantwortung von Fragen, digitale Assistenten und Online-Shopping. Neue Errungenschaften im maschinellen Lernen und das außerordentliche Wachstum des Internets haben zu riesigen Wissensgraphen gefĂŒhrt. Diese umfassen hĂ€ufig Milliarden von Fakten ĂŒber Hunderte von Millionen von EntitĂ€ten; hĂ€ufig aus vielen verschiedenen Quellen. WĂ€hrend die Integration unabhĂ€ngiger Wissensquellen zu einer großen Informationsvielfalt fĂŒhren kann, fĂŒhrt sie inhĂ€rent zu HeterogenitĂ€ten in der WissensreprĂ€sentation. Diese HeterogenitĂ€t in den Daten gefĂ€hrdet den praktischen Nutzen der Wissensgraphen. Durch ihre GrĂ¶ĂŸe lassen sich die Wissensgraphen allerdings nicht mehr manuell bereinigen. DafĂŒr werden heutzutage hĂ€ufig automatische und halbautomatische Methoden benötigt. In dieser Arbeit befassen wir uns mit dem Thema ReprĂ€sentationsheterogenitĂ€t. Wir klassifizieren HeterogenitĂ€t entlang verschiedener Dimensionen und erlĂ€utern HeterogenitĂ€tsprobleme in Datenbanken, Ontologien und Wissensgraphen. Weiterhin geben wir einen knappen Überblick ĂŒber verschiedene Techniken zur automatischen Lösung von HeterogenitĂ€tsproblemen. Im nĂ€chsten Kapitel beschĂ€ftigen wir uns mit EntitĂ€tsheterogenitĂ€t. Wir zeigen Probleme auf, die in einem Multi-Wissensgraphen-Szenario aufgrund von fehlerhaften transitiven Links entstehen. Um diese Probleme zu lösen stellen wir vier Techniken vor, mit denen sich die QualitĂ€t beliebiger Entity-Alignment-Tools deutlich verbessern lĂ€sst. Wir zeigen, dass RelationsheterogenitĂ€t in Wissensgraphen zu Problemen bei der Anfragenbeantwortung fĂŒhren kann. Daher entwickeln wir verschiedene Methoden um synonyme Relationen zu finden. Eine der Methoden arbeitet mit hochdimensionalen Wissensgrapheinbettungen, die andere mit einem Rule Mining Ansatz. Beide Methoden können synonyme Relationen in Wissensgraphen mit hoher QualitĂ€t erkennen. DarĂŒber hinaus stellen wir eine neuartige Technik zur Vermeidung von HeterogenitĂ€tsproblemen vor, bei der wir eine implizite WissensreprĂ€sentation verwenden. Wir zeigen, dass große neuronale Sprachmodelle eine wertvolle Wissensquelle sind, die Ă€hnlich wie Wissensgraphen angefragt werden können. Im Sprachmodell selbst werden bereits viele der HeterogenitĂ€tsprobleme aufgelöst, so dass eine Anfrage heterogener Wissensgraphen möglich wird

    Was Suchmaschinen nicht können. Holistische EntitÀtssuche auf Web Daten

    Get PDF
    Mehr als 50% aller Web Suchanfragen sind entitĂ€tsbezogen. Benutzer suchen entweder nach EntitĂ€ten oder nach EntitĂ€tsinformationen. Dennoch solche Anfragen von Suchmaschinen nicht gut unterstĂŒtzt. Aufbauend auf dem Konzept des semiotischen Dreiecks aus der kognitiven Psychologie, haben wir drei Anfragetypen zur EntitĂ€tssuche identifiziert: typbasierte Anfragen – Suche nach EntitĂ€ten eines gegebenen Typs, prototypbasierte Anfragen – Suche nach EntitĂ€ten mit bestimmten Eigenschaften, und instanzbasierte Anfragen – Suche nach EntitĂ€ten die Ă€hnlich zu einer gegebene EntitĂ€t sind. FĂŒr typbasierte Anfragen haben wir eine Methode entwickelt die query expansion mit einer self-supervised vocabulary learning Technik auf strukturierten und unstrukturierten Daten verbindet. Unser Ansatz liefert einen guten Kompromiss zwischen Precision und Recall. FĂŒr prototypbasierte Anfragen stellen wir ProSWIP vor. Dies ist ein eigenschaftsbasiertes System um EntitĂ€ten aus dem Web abzurufen. Da aber die Anzahl der Eigenschaften die durch die Benutzer bereitgestellt werden relativ klein sein kann, baut ProSWIP auf direkten Fragen und Benutzer Feedback um die Menge der Eigenschaften zu einer Menge welche die Intentionen der Benutzer korrekt erfasst zu erweitern. Unsere Experimente zeigen dass mit maximal vier Fragen eine perfekte Precision erreicht wird. In dem Fall von instanzbasierten Anfragen besteht die Schwierigkeit darin eine Anfrageform zu finden die die Benutzerintentionen eindeutig macht. Wir stellen eine minimalistische instanzbasierte Anfrage, die aus einem Beispiel und dem entsprechenden EntitĂ€tstypen besteht vor. Mit Hilfe des Konzepts der FamilienĂ€hnlichkeit entwickeln wir eine praktische Lösung um EntitĂ€ten mit Bezug zur der AnfragenentitĂ€t direkt aus dem Web abzurufen. Unser Ansatz erzielt sogar fĂŒr Anfragen, die fĂŒr standard EntitĂ€tssuchaufgaben wie related entity finding problematisch waren, gute Ergebnisse. EntitĂ€tszusammenfassung ist ein anderer Typ von entitĂ€tszentrischen Anfragen, der Informationen bezĂŒglich einer EntitĂ€t bereitstellt. Googles Knowledge Graph ist der Stand der Technik fĂŒr solche Aufgaben. Aber das ZurĂŒckgreifen auf manuell erstellte Knowledgebases schließt weniger bekannten EntitĂ€ten fĂŒr das Knowledge Graph aus. Wir schlagen daher vor datengetriebene AnsĂ€tze zu nutzen. Wir sind ĂŒberzeugt dass das BewĂ€ltigen dieser vier Anfragetypen eine holistische EntitĂ€tssuche auf Web Daten fĂŒr die nĂ€chste Generation von Suchmaschinen ermöglicht.More than 50% of all Web queries are entity related. Users search either for entities or for entity information. Still, search engines do not accommodate entity-centric search very well. Building on the concept of the semiotic triangle from cognitive psychology, which models entity types in terms of intensions and extensions, we identified three types of queries for retrieving entities: type-based queries - searching for entities of a given type, prototype-based queries - searching for entities having certain properties, and instance-based queries - searching for entities being similar to a given entity. For type-based queries we present a method that combines query expansion with a self-supervised vocabulary learning technique built on both structured and unstructured data. Our approach is able to achieve a good tradeoff between precision and recall. For prototype-based queries we propose ProSWIP, a property-based system for retrieving entities from the Web. Since the number of properties given by the users can be quite small, ProSWIP relies on direct questions and user feedback to expand the set of properties to a set that captures the user’s intentions correctly. Our experiments show that within a maximum of four questions the system achieves perfect precision of the selected entities. In the case of instance-based queries the first challenge is to establish a query form that allows for disambiguating user intentions without putting too much cognitive pressure on the user. We propose a minimalistic instance-based query comprising the example entity and intended entity type. With this query and building on the concept of family resemblance we present a practical way for retrieving entities directly from the Web. Our approach can even cope with queries which have proven problematic for benchmark tasks like related entity finding. Providing information about a given entity, entity summarization is another kind of entity-centric query. Google’s Knowledge Graph is the state of the art for this task. But relying entirely on manually curated knowledge bases, the Knowledge Graph does not include all new and less known entities. We propose to use a data-driven approach. Our experiments on real-world entities show the superiority of our method. We are confident that mastering these four query types enables holistic entity search on Web data for the next generation of search engines

    Ontology alignment processes and methods

    Get PDF
    Verkkopalveluiden kehittyessÀ tarve mÀÀritellÀ ontologioita on kasvanut. Ontologioiden avulla Internetin dokumenttien semanttisuutta voidaan kasvattaa ja luoda dokumentteja siten, ettÀ tietokone pystyy ymmÀrtÀmÀÀn niiden sisÀllön. Työn tavoitteena on tutkia ontologioiden linjaamista ja ymmÀrtÀÀ siihen liittyvÀÀ kokonaisprosessia. Linjausprosessissa keskitytÀÀn erityisesti asiantuntijan rooliin. Työn teoriatausta on toteutettu kirjallisuuskatsauksena ontologialinjaukseen liittyen ja eri ontologialinjaustyökaluja tutkimalla. Linjausprosessia on tutkittu kÀytÀnnössÀ tapaustutkimuksena kahden eri tapauksen avulla. Tapaustutkimuksen pohjalta pystytÀÀn toteamaan, ettÀ asiantuntijan rooli on merkittÀvÀ ontologioiden ja erityisesti ontologialinjauksen suhteen. Tutkimusta ontologioiden hyödyntÀmiselle ja kehittÀmiselle tarvitaan, jotta ontologioita aletaan hyödyntÀmÀÀn entisestÀÀn verkkopalveluissa

    Enterprise information integration: on discovering links using genetic programming

    Get PDF
    Both established and emergent business rely heavily on data, chiefly those that wish to become game changers. The current biggest source of data is the Web, where there is a large amount of sparse data. The Web of Data aims at providing a unified view of these islands of data. To realise this vision, it is required that the resources in different data sources that refer to the same real-world entities must be linked, which is they key factor for such a unified view. Link discovery is a trending task that aims at finding link rules that specify whether these links must be established or not. Currently there are many proposals in the literature to produce these links, especially based on meta-heuristics. Unfortunately, creating proposals based on meta-heuristics is not a trivial task, which has led to a lack of comparison between some well-established proposals. On the other hand, it has been proved that these link rules fall short in cases in which resources that refer to different real-world entities are very similar or vice versa. In this dissertation, we introduce several proposals to address the previous lacks in the literature. On the one hand we, introduce Eva4LD, which is a generic framework to build genetic programming proposals for link discovery; which are a kind of meta-heuristics proposals. Furthermore, our framework allows to implement many proposals in the literature and compare their results fairly. On the other hand, we introduce Teide, which applies effectively the link rules increasing significantly their precision without dropping their recall significantly. Unfortunately, Teide does not learn link rules, and applying all the provided link rules is computationally expensive. Due to this reason we introduce Sorbas, which learns what we call contextual link rules.Las empresas que desean establecer un precedente en el panorama actual tienden a recurrir al uso de datos para mejorar sus modelos de negocio. La mayor fuente de datos disponible es la Web, donde una gran cantidad es accesible aunque se encuentre fragmentada en islas de datos. La Web de los Datos tiene como objetivo dar una visión unificada de dichas islas, aunque el almacenamiento de los mismos siga siendo distribuido. Para ofrecer esta visión es necesario enlazar los recursos presentes en las islas de datos que hacen referencia a las mismas entidades del mundo real. Link discovery es el nombre atribuido a esta tarea, la cual se basa en generar reglas de enlazado que permiten establecer bajo qué circunstancias dos recursos deben ser enlazados. Se pueden encontrar diferentes propuestas en la literatura de link discovery, especialmente basadas en meta-heurísticas. Por desgracia comparar propuestas basadas en meta-heurísticas no es trivial. Por otro lado, se ha probado que estas reglas de enlazado no funcionan bien cuando los recursos que hacen referencia a dos entidades distintas del mundo real son muy parecidos, o por el contrario, cuando dos recursos muy distintos hacen referencia a la misma entidad. En esta tesis presentamos varias propuestas. Por un lado, Eva4LD es un framework genérico para desarrollar propuestas de link discovery basadas en programación genética, que es un tipo de meta-heurística. Gracias a nuestro framework, hemos podido implementar distintas propuestas de la literatura y comprar justamente sus resultados. Por otro lado, en la tesis presentamos Teide, una propuesta que recibiendo varias reglas de enlazado las aplica de tal modo que mejora significativamente la precisión de las mismas sin reducir significativamente su cobertura. Por desgracia, Teide es computacionalmente costoso debido a que no aprende reglas. Debido a este motivo, presentamos Sorbas que aprende un tipo de reglas de enlazado que denominamos reglas de enlazado con contexto

    Liage de données RDF : évaluation d'approches interlingues

    Get PDF
    The Semantic Web extends the Web by publishing structured and interlinked data using RDF.An RDF data set is a graph where resources are nodes labelled in natural languages. One of the key challenges of linked data is to be able to discover links across RDF data sets. Given two data sets, equivalent resources should be identified and linked by owl:sameAs links. This problem is particularly difficult when resources are described in different natural languages.This thesis investigates the effectiveness of linguistic resources for interlinking RDF data sets. For this purpose, we introduce a general framework in which each RDF resource is represented as a virtual document containing text information of neighboring nodes. The context of a resource are the labels of the neighboring nodes. Once virtual documents are created, they are projected in the same space in order to be compared. This can be achieved by using machine translation or multilingual lexical resources. Once documents are in the same space, similarity measures to find identical resources are applied. Similarity between elements of this space is taken for similarity between RDF resources.We performed evaluation of cross-lingual techniques within the proposed framework. We experimentally evaluate different methods for linking RDF data. In particular, two strategies are explored: applying machine translation or using references to multilingual resources. Overall, evaluation shows the effectiveness of cross-lingual string-based approaches for linking RDF resources expressed in different languages. The methods have been evaluated on resources in English, Chinese, French and German. The best performance (over 0.90 F-measure) was obtained by the machine translation approach. This shows that the similarity-based method can be successfully applied on RDF resources independently of their type (named entities or thesauri concepts). The best experimental results involving just a pair of languages demonstrated the usefulness of such techniques for interlinking RDF resources cross-lingually.Le Web des donnĂ©es Ă©tend le Web en publiant des donnĂ©es structurĂ©es et liĂ©es en RDF. Un jeu de donnĂ©es RDF est un graphe orientĂ© oĂč les ressources peuvent ĂȘtre des sommets Ă©tiquetĂ©es dans des langues naturelles. Un des principaux dĂ©fis est de dĂ©couvrir les liens entre jeux de donnĂ©es RDF. Étant donnĂ©s deux jeux de donnĂ©es, cela consiste Ă  trouver les ressources Ă©quivalentes et les lier avec des liens owl:sameAs. Ce problĂšme est particuliĂšrement difficile lorsque les ressources sont dĂ©crites dans diffĂ©rentes langues naturelles.Cette thĂšse Ă©tudie l'efficacitĂ© des ressources linguistiques pour le liage des donnĂ©es exprimĂ©es dans diffĂ©rentes langues. Chaque ressource RDF est reprĂ©sentĂ©e comme un document virtuel contenant les informations textuelles des sommets voisins. Les Ă©tiquettes des sommets voisins constituent le contexte d'une ressource. Une fois que les documents sont crĂ©Ă©s, ils sont projetĂ©s dans un mĂȘme espace afin d'ĂȘtre comparĂ©s. Ceci peut ĂȘtre rĂ©alisĂ© Ă  l'aide de la traduction automatique ou de ressources lexicales multilingues. Une fois que les documents sont dans le mĂȘme espace, des mesures de similaritĂ© sont appliquĂ©es afin de trouver les ressources identiques. La similaritĂ© entre les documents est prise pour la similaritĂ© entre les ressources RDF.Nous Ă©valuons expĂ©rimentalement diffĂ©rentes mĂ©thodes pour lier les donnĂ©es RDF. En particulier, deux stratĂ©gies sont explorĂ©es: l'application de la traduction automatique et l'usage des banques de donnĂ©es terminologiques et lexicales multilingues. Dans l'ensemble, l'Ă©valuation montre l'efficacitĂ© de ce type d'approches. Les mĂ©thodes ont Ă©tĂ© Ă©valuĂ©es sur les ressources en anglais, chinois, français, et allemand. Les meilleurs rĂ©sultats (F-mesure > 0.90) ont Ă©tĂ© obtenus par la traduction automatique. L'Ă©valuation montre que la mĂ©thode basĂ©e sur la similaritĂ© peut ĂȘtre appliquĂ©e avec succĂšs sur les ressources RDF indĂ©pendamment de leur type (entitĂ©s nommĂ©es ou concepts de dictionnaires)

    Learning Expressive Linkage Rules for Entity Matching using Genetic Programming

    Get PDF
    A central problem in data integration and data cleansing is to identify pairs of entities in data sets that describe the same real-world object. Many existing methods for matching entities rely on explicit linkage rules, which specify how two entities are compared for equivalence. Unfortunately, writing accurate linkage rules by hand is a non-trivial problem that requires detailed knowledge of the involved data sets. Another important issue is the efficient execution of linkage rules. In this thesis, we propose a set of novel methods that cover the complete entity matching workflow from the generation of linkage rules using genetic programming algorithms to their efficient execution on distributed systems. First, we propose a supervised learning algorithm that is capable of generating linkage rules from a gold standard consisting of set of entity pairs that have been labeled as duplicates or non-duplicates. We show that the introduced algorithm outperforms previously proposed entity matching approaches including the state-of-the-art genetic programming approach by de Carvalho et al. and is capable of learning linkage rules that achieve a similar accuracy than the human written rule for the same problem. In order to also cover use cases for which no gold standard is available, we propose a complementary active learning algorithm that generates a gold standard interactively by asking the user to confirm or decline the equivalence of a small number of entity pairs. In the experimental evaluation, labeling at most 50 link candidates was necessary in order to match the performance that is achieved by the supervised GenLink algorithm on the entire gold standard. Finally, we propose an efficient execution workflow that can be run on cluster of multiple machines. The execution workflow employs a novel multidimensional indexing method that allows the efficient execution of learned linkage rules by reducing the number of required comparisons significantly

    Representation and execution of human know-how on the Web

    Get PDF
    Structured data has been a major component of web resources since the very beginning of the web. Metadata that was originally mostly meant for display purposes gradually expanded to incorporate the semantic content of a page. Until now semantic data on the web has mostly focused on factual knowledge, namely trying to capture “what humans know”. This thesis instead focuses on procedural knowledge, or in other words, “how humans do things”, and in particular on step-by-step instructions. I will present a semantic framework to capture the meaning of sets of instructions with respect to their potential execution. This framework is based on a logical model which I evaluated in terms of its expressiveness and it compatibility with existing languages. I will show how this type of procedural knowledge can be automatically acquired from human-generated instructions on the web, while at the same time bridging the semantic gap, from unstructured to structured, by mapping these resources into a formal process description language. I will demonstrate how procedural and factual data on the web can be integrated automatically using Linked Data, and how this integration results in an overall richer semantic representation. To validate these claims I have conducted large scale knowledge acquisition and integration experiments on two prominent instructional websites and evaluated the results against a human benchmark. Finally, I will demonstrate how existing web technologies allow for this data to seamlessly enrich existing web resources and to be used on the web without the need for centralisation. I have explored the potential uses of formalised instructions by the implementation and testing of concrete prototypes which enable human users to explore know-how and collaborate with machines in novel ways
    corecore