1,037 research outputs found

    A Survey on Graph Kernels

    Get PDF
    Graph kernels have become an established and widely-used technique for solving classification tasks on graphs. This survey gives a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years. We describe and categorize graph kernels based on properties inherent to their design, such as the nature of their extracted graph features, their method of computation and their applicability to problems in practice. In an extensive experimental evaluation, we study the classification accuracy of a large suite of graph kernels on established benchmarks as well as new datasets. We compare the performance of popular kernels with several baseline methods and study the effect of applying a Gaussian RBF kernel to the metric induced by a graph kernel. In doing so, we find that simple baselines become competitive after this transformation on some datasets. Moreover, we study the extent to which existing graph kernels agree in their predictions (and prediction errors) and obtain a data-driven categorization of kernels as result. Finally, based on our experimental results, we derive a practitioner's guide to kernel-based graph classification

    kLog: A Language for Logical and Relational Learning with Kernels

    Full text link
    We introduce kLog, a novel approach to statistical relational learning. Unlike standard approaches, kLog does not represent a probability distribution directly. It is rather a language to perform kernel-based learning on expressive logical and relational representations. kLog allows users to specify learning problems declaratively. It builds on simple but powerful concepts: learning from interpretations, entity/relationship data modeling, logic programming, and deductive databases. Access by the kernel to the rich representation is mediated by a technique we call graphicalization: the relational representation is first transformed into a graph --- in particular, a grounded entity/relationship diagram. Subsequently, a choice of graph kernel defines the feature space. kLog supports mixed numerical and symbolic data, as well as background knowledge in the form of Prolog or Datalog programs as in inductive logic programming systems. The kLog framework can be applied to tackle the same range of tasks that has made statistical relational learning so popular, including classification, regression, multitask learning, and collective classification. We also report about empirical comparisons, showing that kLog can be either more accurate, or much faster at the same level of accuracy, than Tilde and Alchemy. kLog is GPLv3 licensed and is available at http://klog.dinfo.unifi.it along with tutorials

    Information overload in structured data

    Get PDF
    Information overload refers to the difficulty of making decisions caused by too much information. In this dissertation, we address information overload problem in two separate structured domains, namely, graphs and text. Graph kernels have been proposed as an efficient and theoretically sound approach to compute graph similarity. They decompose graphs into certain sub-structures, such as subtrees, or subgraphs. However, existing graph kernels suffer from a few drawbacks. First, the dimension of the feature space associated with the kernel often grows exponentially as the complexity of sub-structures increase. One immediate consequence of this behavior is that small, non-informative, sub-structures occur more frequently and cause information overload. Second, as the number of features increase, we encounter sparsity: only a few informative sub-structures will co-occur in multiple graphs. In the first part of this dissertation, we propose to tackle the above problems by exploiting the dependency relationship among sub-structures. First, we propose a novel framework that learns the latent representations of sub-structures by leveraging recent advancements in deep learning. Second, we propose a general smoothing framework that takes structural similarity into account, inspired by state-of-the-art smoothing techniques used in natural language processing. Both the proposed frameworks are applicable to popular graph kernel families, and achieve significant performance improvements over state-of-the-art graph kernels. In the second part of this dissertation, we tackle information overload in text. We first focus on a popular social news aggregation website, Reddit, and design a submodular recommender system that tailors a personalized frontpage for individual users. Second, we propose a novel submodular framework to summarize videos, where both transcript and comments are available. Third, we demonstrate how to apply filtering techniques to select a small subset of informative features from virtual machine logs in order to predict resource usage

    Taxonomy of datasets in graph learning : a data-driven approach to improve GNN benchmarking

    Full text link
    The core research of this thesis, mostly comprising chapter four, has been accepted to the Learning on Graphs (LoG) 2022 conference for a spotlight presentation as a standalone paper, under the title "Taxonomy of Benchmarks in Graph Representation Learning", and is to be published in the Proceedings of Machine Learning Research (PMLR) series. As a main author of the paper, my specific contributions to this paper cover problem formulation, design and implementation of our taxonomy framework and experimental pipeline, collation of our results and of course the writing of the article.L'apprentissage profond sur les graphes a atteint des niveaux de succès sans précédent ces dernières années grâce aux réseaux de neurones de graphes (GNN), des architectures de réseaux de neurones spécialisées qui ont sans équivoque surpassé les approches antérieurs d'apprentissage définies sur des graphes. Les GNN étendent le succès des réseaux de neurones aux données structurées en graphes en tenant compte de leur géométrie intrinsèque. Bien que des recherches approfondies aient été effectuées sur le développement de GNN avec des performances supérieures à celles des modèles références d'apprentissage de représentation graphique, les procédures d'analyse comparative actuelles sont insuffisantes pour fournir des évaluations justes et efficaces des modèles GNN. Le problème peut-être le plus répandu et en même temps le moins compris en ce qui concerne l'analyse comparative des graphiques est la "couverture de domaine": malgré le nombre croissant d'ensembles de données graphiques disponibles, la plupart d'entre eux ne fournissent pas d'informations supplémentaires et au contraire renforcent les biais potentiellement nuisibles dans le développement d’un modèle GNN. Ce problème provient d'un manque de compréhension en ce qui concerne les aspects d'un modèle donné qui sont sondés par les ensembles de données de graphes. Par exemple, dans quelle mesure testent-ils la capacité d'un modèle à tirer parti de la structure du graphe par rapport aux fonctionnalités des nœuds? Ici, nous développons une approche fondée sur des principes pour taxonomiser les ensembles de données d'analyse comparative selon un "profil de sensibilité" qui est basé sur la quantité de changement de performance du GNN en raison d'une collection de perturbations graphiques. Notre analyse basée sur les données permet de mieux comprendre quelles caractéristiques des données de référence sont exploitées par les GNN. Par conséquent, notre taxonomie peut aider à la sélection et au développement de repères graphiques adéquats et à une évaluation mieux informée des futures méthodes GNN. Enfin, notre approche et notre implémentation dans le package GTaxoGym (https://github.com/G-Taxonomy-Workgroup/GTaxoGym) sont extensibles à plusieurs types de tâches de prédiction de graphes et à des futurs ensembles de données.Deep learning on graphs has attained unprecedented levels of success in recent years thanks to Graph Neural Networks (GNNs), specialized neural network architectures that have unequivocally surpassed prior graph learning approaches. GNNs extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNNs with superior performance according to a collection of graph representation learning benchmarks, current benchmarking procedures are insufficient to provide fair and effective evaluations of GNN models. Perhaps the most prevalent and at the same time least understood problem with respect to graph benchmarking is "domain coverage": Despite the growing number of available graph datasets, most of them do not provide additional insights and on the contrary reinforce potentially harmful biases in GNN model development. This problem stems from a lack of understanding with respect to what aspects of a given model are probed by graph datasets. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a "sensitivity profile" that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in the GTaxoGym package (https://github.com/G-Taxonomy-Workgroup/GTaxoGym) are extendable to multiple graph prediction task types and future datasets

    Weisfeiler and Leman go Hyperbolic: Learning Distance Preserving Node Representations

    Full text link
    In recent years, graph neural networks (GNNs) have emerged as a promising tool for solving machine learning problems on graphs. Most GNNs are members of the family of message passing neural networks (MPNNs). There is a close connection between these models and the Weisfeiler-Leman (WL) test of isomorphism, an algorithm that can successfully test isomorphism for a broad class of graphs. Recently, much research has focused on measuring the expressive power of GNNs. For instance, it has been shown that standard MPNNs are at most as powerful as WL in terms of distinguishing non-isomorphic graphs. However, these studies have largely ignored the distances between the representations of nodes/graphs which are of paramount importance for learning tasks. In this paper, we define a distance function between nodes which is based on the hierarchy produced by the WL algorithm, and propose a model that learns representations which preserve those distances between nodes. Since the emerging hierarchy corresponds to a tree, to learn these representations, we capitalize on recent advances in the field of hyperbolic neural networks. We empirically evaluate the proposed model on standard node and graph classification datasets where it achieves competitive performance with state-of-the-art models
    corecore