120 research outputs found

    A Review of Relational Machine Learning for Knowledge Graphs

    Get PDF
    Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two different kinds of statistical relational models, both of which can scale to massive datasets. The first is based on tensor factorization methods and related latent variable models. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. In particular, we discuss Google’s Knowledge Vault project.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216

    On Defining SPARQL with Boolean Tensor Algebra

    Full text link
    The Resource Description Framework (RDF) represents information as subject-predicate-object triples. These triples are commonly interpreted as a directed labelled graph. We propose an alternative approach, interpreting the data as a 3-way Boolean tensor. We show how SPARQL queries - the standard queries for RDF - can be expressed as elementary operations in Boolean algebra, giving us a complete re-interpretation of RDF and SPARQL. We show how the Boolean tensor interpretation allows for new optimizations and analyses of the complexity of SPARQL queries. For example, estimating the size of the results for different join queries becomes much simpler

    Exploiting prior knowledge and latent variable representations for the statistical modeling and probabilistic querying of large knowledge graphs

    Get PDF
    Large knowledge graphs increasingly add great value to various applications that require machines to recognize and understand queries and their semantics, as in search or question answering systems. These applications include Google search, Bing search, IBM’s Watson, but also smart mobile assistants as Apple’s Siri, Google Now or Microsoft’s Cortana. Popular knowledge graphs like DBpedia, YAGO or Freebase store a broad range of facts about the world, to a large extent derived from Wikipedia, currently the biggest web encyclopedia. In addition to these freely accessible open knowledge graphs, commercial ones have also evolved including the well-known Google Knowledge Graph or Microsoft’s Satori. Since incompleteness and veracity of knowledge graphs are known problems, the statistical modeling of knowledge graphs has increasingly gained attention in recent years. Some of the leading approaches are based on latent variable models which show both excellent predictive performance and scalability. Latent variable models learn embedding representations of domain entities and relations (representation learning). From these embeddings, priors for every possible fact in the knowledge graph are generated which can be exploited for data cleansing, completion or as prior knowledge to support triple extraction from unstructured textual data as successfully demonstrated by Google’s Knowledge-Vault project. However, large knowledge graphs impose constraints on the complexity of the latent embeddings learned by these models. For graphs with millions of entities and thousands of relation-types, latent variable models are required to exploit low dimensional embeddings for entities and relation-types to be tractable when applied to these graphs. The work described in this thesis extends the application of latent variable models for large knowledge graphs in three important dimensions. First, it is shown how the integration of ontological constraints on the domain and range of relation-types enables latent variable models to exploit latent embeddings of reduced complexity for modeling large knowledge graphs. The integration of this prior knowledge into the models leads to a substantial increase both in predictive performance and scalability with improvements of up to 77% in link-prediction tasks. Since manually designed domain and range constraints can be absent or fuzzy, we also propose and study an alternative approach based on a local closed-world assumption, which derives domain and range constraints from observed data without the need of prior knowledge extracted from the curated schema of the knowledge graph. We show that such an approach also leads to similar significant improvements in modeling quality. Further, we demonstrate that these two types of domain and range constraints are of general value to latent variable models by integrating and evaluating them on the current state of the art of latent variable models represented by RESCAL, Translational Embedding, and the neural network approach used by the recently proposed Google Knowledge Vault system. In the second part of the thesis it is shown that the just mentioned three approaches all perform well, but do not share many commonalities in the way they model knowledge graphs. These differences can be exploited in ensemble solutions which improve the predictive performance even further. The third part of the thesis concerns the efficient querying of the statistically modeled knowledge graphs. This thesis interprets statistically modeled knowledge graphs as probabilistic databases, where the latent variable models define a probability distribution for triples. From this perspective, link-prediction is equivalent to querying ground triples which is a standard functionality of the latent variable models. For more complex querying that involves e.g. joins and projections, the theory on probabilistic databases provides evaluation rules. In this thesis it is shown how the intrinsic features of latent variable models can be combined with the theory of probabilistic databases to realize efficient probabilistic querying of the modeled graphs

    Scalable statistical learning for relation prediction on structured data

    Get PDF
    Relation prediction seeks to predict unknown but potentially true relations by revealing missing relations in available data, by predicting future events based on historical data, and by making predicted relations retrievable by query. The approach developed in this thesis can be used for a wide variety of purposes, including to predict likely new friends on social networks, attractive points of interest for an individual visiting an unfamiliar city, and associations between genes and particular diseases. In recent years, relation prediction has attracted significant interest in both research and application domains, partially due to the increasing volume of published structured data and background knowledge. In the Linked Open Data initiative of the Semantic Web, for instance, entities are uniquely identified such that the published information can be integrated into applications and services, and the rapid increase in the availability of such structured data creates excellent opportunities as well as challenges for relation prediction. This thesis focuses on the prediction of potential relations by exploiting regularities in data using statistical relational learning algorithms and applying these methods to relational knowledge bases, in particular in Linked Open Data in particular. We review representative statistical relational learning approaches, e.g., Inductive Logic Programming and Probabilistic Relational Models. While logic-based reasoning can infer and include new relations via deduction by using ontologies, machine learning can be exploited to predict new relations (with some degree of certainty) via induction, purely based on the data. Because the application of machine learning approaches to relation prediction usually requires handling large datasets, we also discuss the scalability of machine learning as a solution to relation prediction, as well as the significant challenge posed by incomplete relational data (such as social network data, which is often much more extensive for some users than others). The main contribution of this thesis is to develop a learning framework called the Statistical Unit Node Set (SUNS) and to propose a multivariate prediction approach used in the framework. We argue that multivariate prediction approaches are most suitable for dealing with large, sparse data matrices. According to the characteristics and intended application of the data, the approach can be extended in different ways. We discuss and test two extensions of the approach--kernelization and a probabilistic method of handling complex n-ary relationships--in empirical studies based on real-world data sets. Additionally, this thesis contributes to the field of relation prediction by applying the SUNS framework to various domains. We focus on three applications: 1. In social network analysis, we present a combined approach of inductive and deductive reasoning for recommending movies to users. 2. In the life sciences, we address the disease gene prioritization problem. 3. In the recommendation system, we describe and investigate the back-end of a mobile app called BOTTARI, which provides personalized location-based recommendations of restaurants.Die Beziehungsvorhersage strebt an, unbekannte aber potenziell wahre Beziehungen vorherzusagen, indem fehlende Relationen in verfĂŒgbaren Daten aufgedeckt, zukĂŒnftige Ereignisse auf der Grundlage historischer Daten prognostiziert und vorhergesagte Relationen durch Anfragen abrufbar gemacht werden. Der in dieser Arbeit entwickelte Ansatz lĂ€sst sich fĂŒr eine Vielzahl von Zwecken einschließlich der Vorhersage wahrscheinlicher neuer Freunde in sozialen Netzen, der Empfehlung attraktiver SehenswĂŒrdigkeiten fĂŒr Touristen in fremden StĂ€dten und der Priorisierung möglicher Assoziationen zwischen Genen und bestimmten Krankheiten, verwenden. In den letzten Jahren hat die Beziehungsvorhersage sowohl in Forschungs- als auch in Anwendungsbereichen eine enorme Aufmerksamkeit erregt, aufgrund des Zuwachses veröffentlichter strukturierter Daten und von Hintergrundwissen. In der Linked Open Data-Initiative des Semantischen Web werden beispielsweise EntitĂ€ten eindeutig identifiziert, sodass die veröffentlichten Informationen in Anwendungen und Dienste integriert werden können. Diese rapide Erhöhung der VerfĂŒgbarkeit strukturierter Daten bietet hervorragende Gelegenheiten sowie Herausforderungen fĂŒr die Beziehungsvorhersage. Diese Arbeit fokussiert sich auf die Vorhersage potenzieller Beziehungen durch Ausnutzung von RegelmĂ€ĂŸigkeiten in Daten unter der Verwendung statistischer relationaler Lernalgorithmen und durch Einsatz dieser Methoden in relationale Wissensbasen, insbesondere in den Linked Open Daten. Wir geben einen Überblick ĂŒber reprĂ€sentative statistische relationale LernansĂ€tze, z.B. die Induktive Logikprogrammierung und Probabilistische Relationale Modelle. WĂ€hrend das logikbasierte Reasoning neue Beziehungen unter der Nutzung von Ontologien ableiten und diese einbeziehen kann, kann maschinelles Lernen neue Beziehungen (mit gewisser Wahrscheinlichkeit) durch Induktion ausschließlich auf der Basis der vorliegenden Daten vorhersagen. Da die Verarbeitung von massiven Datenmengen in der Regel erforderlich ist, wenn maschinelle Lernmethoden in die Beziehungsvorhersage eingesetzt werden, diskutieren wir auch die Skalierbarkeit des maschinellen Lernens sowie die erhebliche Herausforderung, die sich aus unvollstĂ€ndigen relationalen Daten ergibt (z. B. Daten aus sozialen Netzen, die oft fĂŒr manche Benutzer wesentlich umfangreicher sind als fĂŒr Anderen). Der Hauptbeitrag der vorliegenden Arbeit besteht darin, ein Lernframework namens Statistical Unit Node Set (SUNS) zu entwickeln und einen im Framework angewendeten multivariaten PrĂ€diktionsansatz einzubringen. Wir argumentieren, dass multivariate VorhersageansĂ€tze am besten fĂŒr die Bearbeitung von großen und dĂŒnnbesetzten Datenmatrizen geeignet sind. Je nach den Eigenschaften und der beabsichtigten Anwendung der Daten kann der Ansatz auf verschiedene Weise erweitert werden. In empirischen Studien werden zwei Erweiterungen des Ansatzes--ein kernelisierter Ansatz sowie ein probabilistischer Ansatz zur Behandlung komplexer n-stelliger Beziehungen-- diskutiert und auf realen DatensĂ€tzen untersucht. Ein weiterer Beitrag dieser Arbeit ist die Anwendung des SUNS Frameworks auf verschiedene Bereiche. Wir konzentrieren uns auf drei Anwendungen: 1. In der Analyse sozialer Netze stellen wir einen kombinierten Ansatz von induktivem und deduktivem Reasoning vor, um Benutzern Filme zu empfehlen. 2. In den Biowissenschaften befassen wir uns mit dem Problem der Priorisierung von Krankheitsgenen. 3. In den Empfehlungssystemen beschreiben und untersuchen wir das Backend einer mobilen App "BOTTARI", das personalisierte ortsbezogene Empfehlungen von Restaurants bietet

    Scalable statistical learning for relation prediction on structured data

    Get PDF
    Relation prediction seeks to predict unknown but potentially true relations by revealing missing relations in available data, by predicting future events based on historical data, and by making predicted relations retrievable by query. The approach developed in this thesis can be used for a wide variety of purposes, including to predict likely new friends on social networks, attractive points of interest for an individual visiting an unfamiliar city, and associations between genes and particular diseases. In recent years, relation prediction has attracted significant interest in both research and application domains, partially due to the increasing volume of published structured data and background knowledge. In the Linked Open Data initiative of the Semantic Web, for instance, entities are uniquely identified such that the published information can be integrated into applications and services, and the rapid increase in the availability of such structured data creates excellent opportunities as well as challenges for relation prediction. This thesis focuses on the prediction of potential relations by exploiting regularities in data using statistical relational learning algorithms and applying these methods to relational knowledge bases, in particular in Linked Open Data in particular. We review representative statistical relational learning approaches, e.g., Inductive Logic Programming and Probabilistic Relational Models. While logic-based reasoning can infer and include new relations via deduction by using ontologies, machine learning can be exploited to predict new relations (with some degree of certainty) via induction, purely based on the data. Because the application of machine learning approaches to relation prediction usually requires handling large datasets, we also discuss the scalability of machine learning as a solution to relation prediction, as well as the significant challenge posed by incomplete relational data (such as social network data, which is often much more extensive for some users than others). The main contribution of this thesis is to develop a learning framework called the Statistical Unit Node Set (SUNS) and to propose a multivariate prediction approach used in the framework. We argue that multivariate prediction approaches are most suitable for dealing with large, sparse data matrices. According to the characteristics and intended application of the data, the approach can be extended in different ways. We discuss and test two extensions of the approach--kernelization and a probabilistic method of handling complex n-ary relationships--in empirical studies based on real-world data sets. Additionally, this thesis contributes to the field of relation prediction by applying the SUNS framework to various domains. We focus on three applications: 1. In social network analysis, we present a combined approach of inductive and deductive reasoning for recommending movies to users. 2. In the life sciences, we address the disease gene prioritization problem. 3. In the recommendation system, we describe and investigate the back-end of a mobile app called BOTTARI, which provides personalized location-based recommendations of restaurants.Die Beziehungsvorhersage strebt an, unbekannte aber potenziell wahre Beziehungen vorherzusagen, indem fehlende Relationen in verfĂŒgbaren Daten aufgedeckt, zukĂŒnftige Ereignisse auf der Grundlage historischer Daten prognostiziert und vorhergesagte Relationen durch Anfragen abrufbar gemacht werden. Der in dieser Arbeit entwickelte Ansatz lĂ€sst sich fĂŒr eine Vielzahl von Zwecken einschließlich der Vorhersage wahrscheinlicher neuer Freunde in sozialen Netzen, der Empfehlung attraktiver SehenswĂŒrdigkeiten fĂŒr Touristen in fremden StĂ€dten und der Priorisierung möglicher Assoziationen zwischen Genen und bestimmten Krankheiten, verwenden. In den letzten Jahren hat die Beziehungsvorhersage sowohl in Forschungs- als auch in Anwendungsbereichen eine enorme Aufmerksamkeit erregt, aufgrund des Zuwachses veröffentlichter strukturierter Daten und von Hintergrundwissen. In der Linked Open Data-Initiative des Semantischen Web werden beispielsweise EntitĂ€ten eindeutig identifiziert, sodass die veröffentlichten Informationen in Anwendungen und Dienste integriert werden können. Diese rapide Erhöhung der VerfĂŒgbarkeit strukturierter Daten bietet hervorragende Gelegenheiten sowie Herausforderungen fĂŒr die Beziehungsvorhersage. Diese Arbeit fokussiert sich auf die Vorhersage potenzieller Beziehungen durch Ausnutzung von RegelmĂ€ĂŸigkeiten in Daten unter der Verwendung statistischer relationaler Lernalgorithmen und durch Einsatz dieser Methoden in relationale Wissensbasen, insbesondere in den Linked Open Daten. Wir geben einen Überblick ĂŒber reprĂ€sentative statistische relationale LernansĂ€tze, z.B. die Induktive Logikprogrammierung und Probabilistische Relationale Modelle. WĂ€hrend das logikbasierte Reasoning neue Beziehungen unter der Nutzung von Ontologien ableiten und diese einbeziehen kann, kann maschinelles Lernen neue Beziehungen (mit gewisser Wahrscheinlichkeit) durch Induktion ausschließlich auf der Basis der vorliegenden Daten vorhersagen. Da die Verarbeitung von massiven Datenmengen in der Regel erforderlich ist, wenn maschinelle Lernmethoden in die Beziehungsvorhersage eingesetzt werden, diskutieren wir auch die Skalierbarkeit des maschinellen Lernens sowie die erhebliche Herausforderung, die sich aus unvollstĂ€ndigen relationalen Daten ergibt (z. B. Daten aus sozialen Netzen, die oft fĂŒr manche Benutzer wesentlich umfangreicher sind als fĂŒr Anderen). Der Hauptbeitrag der vorliegenden Arbeit besteht darin, ein Lernframework namens Statistical Unit Node Set (SUNS) zu entwickeln und einen im Framework angewendeten multivariaten PrĂ€diktionsansatz einzubringen. Wir argumentieren, dass multivariate VorhersageansĂ€tze am besten fĂŒr die Bearbeitung von großen und dĂŒnnbesetzten Datenmatrizen geeignet sind. Je nach den Eigenschaften und der beabsichtigten Anwendung der Daten kann der Ansatz auf verschiedene Weise erweitert werden. In empirischen Studien werden zwei Erweiterungen des Ansatzes--ein kernelisierter Ansatz sowie ein probabilistischer Ansatz zur Behandlung komplexer n-stelliger Beziehungen-- diskutiert und auf realen DatensĂ€tzen untersucht. Ein weiterer Beitrag dieser Arbeit ist die Anwendung des SUNS Frameworks auf verschiedene Bereiche. Wir konzentrieren uns auf drei Anwendungen: 1. In der Analyse sozialer Netze stellen wir einen kombinierten Ansatz von induktivem und deduktivem Reasoning vor, um Benutzern Filme zu empfehlen. 2. In den Biowissenschaften befassen wir uns mit dem Problem der Priorisierung von Krankheitsgenen. 3. In den Empfehlungssystemen beschreiben und untersuchen wir das Backend einer mobilen App "BOTTARI", das personalisierte ortsbezogene Empfehlungen von Restaurants bietet

    Link prediction in very large directed graphs: Exploiting hierarchical properties in parallel

    Get PDF
    Link prediction is a link mining task that tries to find new edges within a given graph. Among the targets of link prediction there is large directed graphs, which are frequent structures nowadays. The typical sparsity of large graphs demands of high precision predictions in order to obtain usable results. However, the size of those graphs only permits the execution of scalable algorithms. As a trade-off between those two problems we recently proposed a link prediction algorithm for directed graphs that exploits hierarchical properties. The algorithm can be classified as a local score, which entails scalability. Unlike the rest of local scores, our proposal assumes the existence of an underlying model for the data which allows it to produce predictions with a higher precision. We test the validity of its hierarchical assumptions on two clearly hierarchical data sets, one of them based on RDF. Then we test it on a non-hierarchical data set based on Wikipedia to demonstrate its broad applicability. Given the computational complexity of link prediction in very large graphs we also introduce some general recommendations useful to make of link prediction an efficiently parallelized problem.Peer ReviewedPostprint (published version

    Modeling Relational Data with Graph Convolutional Networks

    Get PDF
    Knowledge graphs enable a wide variety of applications, including question answering and information retrieval. Despite the great effort invested in their creation and maintenance, even the largest (e.g., Yago, DBPedia or Wikidata) remain incomplete. We introduce Relational Graph Convolutional Networks (R-GCNs) and apply them to two standard knowledge base completion tasks: Link prediction (recovery of missing facts, i.e. subject-predicate-object triples) and entity classification (recovery of missing entity attributes). R-GCNs are related to a recent class of neural networks operating on graphs, and are developed specifically to deal with the highly multi-relational data characteristic of realistic knowledge bases. We demonstrate the effectiveness of R-GCNs as a stand-alone model for entity classification. We further show that factorization models for link prediction such as DistMult can be significantly improved by enriching them with an encoder model to accumulate evidence over multiple inference steps in the relational graph, demonstrating a large improvement of 29.8% on FB15k-237 over a decoder-only baseline

    Bias Assessments of Benchmarks for Link Predictions over Knowledge Graphs

    Get PDF
    Link prediction (LP) aims to tackle the challenge of predicting new facts by reasoning over a knowledge graph (KG). Different machine learning architectures have been proposed to solve the task of LP, several of them competing for better performance on a few de-facto benchmarks. The problem of this thesis is the characterization of LP datasets regarding their structural bias properties and their effects on attained performance results. We provide a domain-agnostic framework that assesses the network topology, test leakage bias and sample selection bias in LP datasets. The framework includes SPARQL queries that can be reused in the explorative data analysis of KGs for uncovering unusual patterns. We finally apply our framework to characterize 7 common benchmarks used for assessing the task of LP. In conducted experiments, we use a trained TransE model to show how the two bias types affect prediction results. Our analysis shows problematic patterns in most of the benchmark datasets. Especially critical are the findings regarding the state-of-the-art benchmarks FB15k-237, WN18RR and YAGO3-10
    • 

    corecore