107 research outputs found

    A Survey on Graph Kernels

    Get PDF
    Graph kernels have become an established and widely-used technique for solving classification tasks on graphs. This survey gives a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years. We describe and categorize graph kernels based on properties inherent to their design, such as the nature of their extracted graph features, their method of computation and their applicability to problems in practice. In an extensive experimental evaluation, we study the classification accuracy of a large suite of graph kernels on established benchmarks as well as new datasets. We compare the performance of popular kernels with several baseline methods and study the effect of applying a Gaussian RBF kernel to the metric induced by a graph kernel. In doing so, we find that simple baselines become competitive after this transformation on some datasets. Moreover, we study the extent to which existing graph kernels agree in their predictions (and prediction errors) and obtain a data-driven categorization of kernels as result. Finally, based on our experimental results, we derive a practitioner's guide to kernel-based graph classification

    Propagation Kernels

    Full text link
    We introduce propagation kernels, a general graph-kernel framework for efficiently measuring the similarity of structured data. Propagation kernels are based on monitoring how information spreads through a set of given graphs. They leverage early-stage distributions from propagation schemes such as random walks to capture structural information encoded in node labels, attributes, and edge information. This has two benefits. First, off-the-shelf propagation schemes can be used to naturally construct kernels for many graph types, including labeled, partially labeled, unlabeled, directed, and attributed graphs. Second, by leveraging existing efficient and informative propagation schemes, propagation kernels can be considerably faster than state-of-the-art approaches without sacrificing predictive performance. We will also show that if the graphs at hand have a regular structure, for instance when modeling image or video data, one can exploit this regularity to scale the kernel computation to large databases of graphs with thousands of nodes. We support our contributions by exhaustive experiments on a number of real-world graphs from a variety of application domains

    Frequent Subgraph Mining via Sampling with Rigorous Guarantees

    Get PDF
    Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs. In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process. Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications.Frequent subgraph mining is a fundamental task in the analysis of collections of graphs that aims at finding all the subgraphs that appear with more than a user-specified frequency in the dataset. While several exact approaches have been proposed to solve the task, it remains computationally challenging on large graph datasets due to the complexity of the subgraph isomorphism problem inherent in the task and the huge number of candidate patterns even for fairly small subgraphs. In this thesis, we study two statistical learning measures of complexity, VC-dimension and Rademacher averages, for subgraphs, and derive efficiently computable bounds for both. We then show how such bounds can be applied to devise efficient sampling-based approaches for rigorously approximating the solutions of the frequent subgraph mining problem, providing sample sizes which are much tighter than what would be obtained by a straightforward application of Chernoff and union bounds. We also show that our bounds can be used for true frequent subgraph mining, which requires to identify subgraphs generated with probability above a given threshold using samples from an unknown generative process. Moreover, we carried out an extensive experimental evaluation of our methods on real datasets, which shows that our bounds lead to efficiently computable and high-quality approximations for both applications

    Learning with Graphs using Kernels from Propagated Information

    Get PDF
    Traditional machine learning approaches are designed to learn from independent vector-valued data points. The assumption that instances are independent, however, is not always true. On the contrary, there are numerous domains where data points are cross-linked, for example social networks, where persons are linked by friendship relations. These relations among data points make traditional machine learning diffcult and often insuffcient. Furthermore, data points themselves can have complex structure, for example molecules or proteins constructed from various bindings of different atoms. Networked and structured data are naturally represented by graphs, and for learning we aimto exploit their structure to improve upon non-graph-based methods. However, graphs encountered in real-world applications often come with rich additional information. This naturally implies many challenges for representation and learning: node information is likely to be incomplete leading to partially labeled graphs, information can be aggregated from multiple sources and can therefore be uncertain, or additional information on nodes and edges can be derived from complex sensor measurements, thus being naturally continuous. Although learning with graphs is an active research area, learning with structured data, substantially modeling structural similarities of graphs, mostly assumes fully labeled graphs of reasonable sizes with discrete and certain node and edge information, and learning with networked data, naturally dealing with missing information and huge graphs, mostly assumes homophily and forgets about structural similarity. To close these gaps, we present a novel paradigm for learning with graphs, that exploits the intermediate results of iterative information propagation schemes on graphs. Originally developed for within-network relational and semi-supervised learning, these propagation schemes have two desirable properties: they capture structural information and they can naturally adapt to the aforementioned issues of real-world graph data. Additionally, information propagation can be efficiently realized by random walks leading to fast, flexible, and scalable feature and kernel computations. Further, by considering intermediate random walk distributions, we can model structural similarity for learning with structured and networked data. We develop several approaches based on this paradigm. In particular, we introduce propagation kernels for learning on the graph level and coinciding walk kernels and Markov logic sets for learning on the node level. Finally, we present two application domains where kernels from propagated information successfully tackle real-world problems

    Foundations of population-based SHM, part II : heterogeneous populations – graphs, networks, and communities

    Get PDF
    This paper is the second in a series of three which aims to provide a basis for Population-Based Structural Health Monitoring (PBSHM); a new technology which will allow transfer of diagnostic information across a population of structures, augmenting SHM capability beyond that applicable to individual structures. The new PBSHM can potentially allow knowledge about normal operating conditions, damage states, and even physics-based models to be transferred between structures. The first part in this series considered homogeneous populations of nominally-identical structures. The theory is extended in this paper to heterogeneous populations of disparate structures. In order to achieve this aim, the paper introduces an abstract representation of structures based on Irreducible Element (IE) models, which capture essential structural characteristics, which are then converted into Attributed Graphs (AGs). The AGs form a complex network of structure models, on which a metric can be used to assess structural similarity; the similarity being a key measure of whether diagnostic information can be successfully transferred. Once a pairwise similarity metric has been established on the network of structures, similar structures are clustered to form communities. Within these communities, it is assumed that a certain level of knowledge transfer is possible. The transfer itself will be accomplished using machine learning methods which will be discussed in the third part of this series. The ideas introduced in this paper can be used to define precise terminology for PBSHM in both the homogeneous and heterogeneous population cases

    Efficient Methods for Mining Subgraphs in a Single Large Graph

    Get PDF
    Large and complex graphs are often used for simulation of the complex relationships among objects in many applications in various fields, such as social networks, maps, computer networks, chemical structures, bioinformatics, computer vision and web analysis. Frequent subgraph mining (FSM) is a vital issue and has attracted numerous researchers in recent years, among them, MNI-based approaches are considered as state-of-the-art, such as the GraMi algorithm. FSM plays an important role in various tasks, such as data mining, model analysis, and decision support systems. It is defined as finding all subgraphs whose occurrences in the dataset are greater than or equal to a given frequency threshold. In recent applications, such as social networks, the underlying graphs are very large, therefore algorithms for mining frequent subgraphs from a single large graph have been developing rapidly lately but all of them have huge search spaces, and therefore still needs a lot of time and memory to process. For frequent subgraph mining field, in this thesis, we have proposed a method to record the support of mined subgraphs; a sorting strategy to reduce the number of generated subgraphs; a parallel processing approach to reduce the mining time; early pruning of invalid values in the domain to balance the search space. Our experiments on four real datasets (both of the directed and undirected graphs) showed that the four proposed algorithms had better results with respect to the search space, the running time and the memory requirements and enhance the performance. Besides that, closed frequent subgraph mining was also developed. This has many practical applications and is a fundamental premise for many studies. We propose a closed frequent subgraph mining algorithm based on GraMi to find all closed frequent subgraphs in a single large graph; two strategies are also developed: namely early determining for closed frequent subgraphs and early pruning non-closed subgraphs; and these are used to improve the performance of the proposed algorithm. All our experiments for closed frequent subgraph mining are performed on five real directed/undirected graph datasets and the results show that the running time as well as the memory requirements of our algorithm are better than those of the GraMi-based algorithm.Velké a složité grafy se často používají pro simulaci komplexních vztahů mezi objekty v mnoha aplikacích v různých oblastech, jako jsou sociální sítě, mapy, počítačové sítě, chemické struktury, bioinformatika, počítačové vidění a webové analýzy. Časté dolování podgrafů (FSM) je zásadní problém a v posledních letech přitahuje řadu výzkumníků, mezi nimi jsou přístupy založené na MNI považovány za nejmodernější, jako je algoritmus GraMi. FSM hraje důležitou roli v různých úkolech, jako je dolování dat, analýza modelů a systémy na podporu rozhodování. Je definována jako nalezení všech podgrafů, jejichž výskyty v datové sadě jsou větší nebo rovné danému frekvenčnímu prahu. V nedávných aplikacích, jako jsou sociální sítě, jsou podkladové grafy velmi velké, a proto se algoritmy pro dolování častých podgrafů z jednoho velkého grafu v poslední době rychle vyvíjejí, ale všechny mají obrovské vyhledávací prostory, a proto stale potřebují spoustu času a paměti ke zpracování. Pro frekventované podgrafní těžební pole jsme v této práci navrhli metodu pro záznam podpory vytěžených podgrafů; strategii třídění pro snížení počtu generovaných podgrafů; přístup paralelního zpracování pro zkrácení doby těžby; včasné ořezávání neplatných hodnot v doméně, aby se vyrovnal prostor pro vyhledávání. Naše experiment na čtyřech reálných souborech dat (jak orientovaných, tak neorientovaných grafů) ukázaly, že naše čtyři navržené algoritmy měly lepší výsledky s ohledem na prohledávací prostor, dobu běhu a požadavky na paměť a zvýšily výkon výpočtu. Mimo to byla rovněž rozvinuta metoda hkedání uzavřených (closed) grafů. To má mnoho praktických aplikací a je základním předpokladem pro mnoho studií. Navrhujeme uzavřený algoritmus dolování častých podgrafů založený na GraMi k nalezení všech uzavřených častých podgrafů v jediném velkém grafu; jsou také vyvinuty dvě strategie: jmenovitě včasné určování pro uzavřené časté podgrafy a včasné ořezávání neuzavřených podgrafů; a ty se používají ke zlepšení výkonu navrhovaného algoritmu. Všechny naše experimenty pro uzavřené časté dolování podgrafů jsou prováděny na pěti skutečných řízených/ neorientovaných grafových datových sadách a výsledky ukazují, že doba běhu a paměťové požadavky našeho algoritmu jsou lepší než u algoritmu založeného na GraMi.460 - Katedra informatikyvyhově

    Incremental communication patterns in online social groups

    Get PDF
    In the last decades, temporal networks played a key role in modelling, understanding, and analysing the properties of dynamic systems where individuals and events vary in time. Of paramount importance is the representation and the analysis of Social Media, in particular Social Networks and Online Communities, through temporal networks, due to their intrinsic dynamism (social ties, online/offline status, users’ interactions, etc.). The identification of recurrent patterns in Online Communities, and in detail in Online Social Groups, is an important challenge which can reveal information concerning the structure of the social network, but also patterns of interactions, trending topics, and so on. Different works have already investigated the pattern detection in several scenarios by focusing mainly on identifying the occurrences of fixed and well known motifs (mostly, triads) or more flexible subgraphs. In this paper, we present the concept on the Incremental Communication Patterns, which is something in-between motifs, from which they inherit the meaningfulness of the identified structure, and subgraph, from which they inherit the possibility to be extended as needed. We formally define the Incremental Communication Patterns and exploit them to investigate the interaction patterns occurring in a real dataset consisting of 17 Online Social Groups taken from the list of Facebook groups. The results regarding our experimental analysis uncover interesting aspects of interactions patterns occurring in social groups and reveal that Incremental Communication Patterns are able to capture roles of the users within the groups
    corecore