44 research outputs found

    Hierarchical stochastic graphlet embedding for graph-based pattern recognition

    Get PDF
    This is the final version. Available on open access from Springer via the DOI in this recordDespite being very successful within the pattern recognition and machine learning community, graph-based methods are often unusable with many machine learning tools. This is because of the incompatibility of most of the mathematical operations in graph domain. Graph embedding has been proposed as a way to tackle these difficulties, which maps graphs to a vector space and makes the standard machine learning techniques applicable for them. However, it is well known that graph embedding techniques usually suffer from the loss of structural information. In this paper, given a graph, we consider its hierarchical structure for mapping it into a vector space. The hierarchical structure is constructed by topologically clustering the graph nodes, and considering each cluster as a node in the upper hierarchical level. Once this hierarchical structure of graph is constructed, we consider its various configurations of its parts, and use stochastic graphlet embedding (SGE) for mapping them into vector space. Broadly speaking, SGE produces a distribution of uniformly sampled low to high order graphlets as a way to embed graphs into the vector space. In what follows, the coarse-to-fine structure of a graph hierarchy and the statistics fetched through the distribution of low to high order stochastic graphlets complements each other and include important structural information with varied contexts. Altogether, these two techniques substantially cope with the usual information loss involved in graph embedding techniques, and it is not a surprise that we obtain more robust vector space embedding of graphs. This fact has been corroborated through a detailed experimental evaluation on various benchmark graph datasets, where we outperform the state-of-the-art methods.European Union Horizon 2020Ministerio de Educación, Cultura y Deporte, SpainGeneralitat de Cataluny

    Interpretable Neural Architecture Search via Bayesian Optimisation with Weisfeiler-Lehman Kernels

    Get PDF
    Current neural architecture search (NAS) strategies focus only on finding a single, good, architecture. They offer little insight into why a specific network is performing well, or how we should modify the architecture if we want further improvements. We propose a Bayesian optimisation (BO) approach for NAS that combines the Weisfeiler-Lehman graph kernel with a Gaussian process surrogate. Our method optimises the architecture in a highly data-efficient manner: it is capable of capturing the topological structures of the architectures and is scalable to large graphs, thus making the high-dimensional and graph-like search spaces amenable to BO. More importantly, our method affords interpretability by discovering useful network features and their corresponding impact on the network performance. Indeed, we demonstrate empirically that our surrogate model is capable of identifying useful motifs which can guide the generation of new architectures. We finally show that our method outperforms existing NAS approaches to achieve the state of the art on both closed- and open-domain search spaces.Comment: ICLR 2021. 9 pages, 5 figures, 1 table (23 pages, 14 figures and 3 tables including references and appendices

    Interpretable neural architecture search via Bayesian optimisation with Weisfeiler-Lehman kernels

    Get PDF
    Current neural architecture search (NAS) strategies focus only on finding a single, good, architecture. They offer little insight into why a specific network is performing well, or how we should modify the architecture if we want further improvements. We propose a Bayesian optimisation (BO) approach for NAS that combines the Weisfeiler-Lehman graph kernel with a Gaussian process surrogate. Our method optimises the architecture in a highly data-efficient manner: it is capable of capturing the topological structures of the architectures and is scalable to large graphs, thus making the high-dimensional and graph-like search spaces amenable to BO. More importantly, our method affords interpretability by discovering useful network features and their corresponding impact on the network performance. Indeed, we demonstrate empirically that our surrogate model is capable of identifying useful motifs which can guide the generation of new architectures. We finally show that our method outperforms existing NAS approaches to achieve the state of the art on both closed- and open-domain search spaces

    On the pursuit of Graph Embedding Strategies for Individual Mobility Networks

    Get PDF
    An Individual Mobility Network (IMN) is a graph representation of the mobility history of an individual that highlights the relevant locations visited (nodes of the graph) and the movements across them (edges), also providing a rich set of annotations of both nodes and edges. Extracting representative features from an IMN has proven to be a valuable task for enabling various learning applications. However, it is also a demanding operation that does not guarantee the inclusion of all important aspects from the human perspective. A vast recent literature on graph embedding goes in a similar direction, yet typically aims at general-purpose methods that might not suit specific contexts. In this paper, we discuss the existing approaches to graph embedding and the specificities of IMNs, trying to find the best matching solutions. We experiment with representative algorithms and study the results in relation to IMN characteristics. Tests are performed on a large dataset of real vehicle trajectories

    Unsupervised Structural Embedding Methods for Efficient Collective Network Mining

    Full text link
    How can we align accounts of the same user across social networks? Can we identify the professional role of an email user from their patterns of communication? Can we predict the medical effects of chemical compounds from their atomic network structure? Many problems in graph data mining, including all of the above, are defined on multiple networks. The central element to all of these problems is cross-network comparison, whether at the level of individual nodes or entities in the network or at the level of entire networks themselves. To perform this comparison meaningfully, we must describe the entities in each network expressively in terms of patterns that generalize across the networks. Moreover, because the networks in question are often very large, our techniques must be computationally efficient. In this thesis, we propose scalable unsupervised methods that embed nodes in vector space by mapping nodes with similar structural roles in their respective networks, even if they come from different networks, to similar parts of the embedding space. We perform network alignment by matching nodes across two or more networks based on the similarity of their embeddings, and refine this process by reinforcing the consistency of each node’s alignment with those of its neighbors. By characterizing the distribution of node embeddings in a graph, we develop graph-level feature vectors that are highly effective for graph classification. With principled sparsification and randomized approximation techniques, we make all our methods computationally efficient and able to scale to graphs with millions of nodes or edges. We demonstrate the effectiveness of structural node embeddings on industry-scale applications, and propose an extensive set of embedding evaluation techniques that lay the groundwork for further methodological development and application.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162895/1/mheimann_1.pd

    Exploiting semantic web knowledge graphs in data mining

    Full text link
    Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. Semantic Web knowledge graphs are a backbone of many information systems that require access to structured knowledge. Such knowledge graphs contain factual knowledge about real word entities and the relations between them, which can be utilized in various natural language processing, information retrieval, and any data mining applications. Following the principles of the Semantic Web, Semantic Web knowledge graphs are publicly available as Linked Open Data. Linked Open Data is an open, interlinked collection of datasets in machine-interpretable form, covering most of the real world domains. In this thesis, we investigate the hypothesis if Semantic Web knowledge graphs can be exploited as background knowledge in different steps of the knowledge discovery process, and different data mining tasks. More precisely, we aim to show that Semantic Web knowledge graphs can be utilized for generating valuable data mining features that can be used in various data mining tasks. Identifying, collecting and integrating useful background knowledge for a given data mining application can be a tedious and time consuming task. Furthermore, most data mining tools require features in propositional form, i.e., binary, nominal or numerical features associated with an instance, while Linked Open Data sources are usually graphs by nature. Therefore, in Part I, we evaluate unsupervised feature generation strategies from types and relations in knowledge graphs, which are used in different data mining tasks, i.e., classification, regression, and outlier detection. As the number of generated features grows rapidly with the number of instances in the dataset, we provide a strategy for feature selection in hierarchical feature space, in order to select only the most informative and most representative features for a given dataset. Furthermore, we provide an end-to-end tool for mining the Web of Linked Data, which provides functionalities for each step of the knowledge discovery process, i.e., linking local data to a Semantic Web knowledge graph, integrating features from multiple knowledge graphs, feature generation and selection, and building machine learning models. However, we show that such feature generation strategies often lead to high dimensional feature vectors even after dimensionality reduction, and also, the reusability of such feature vectors across different datasets is limited. In Part II, we propose an approach that circumvents the shortcomings introduced with the approaches in Part I. More precisely, we develop an approach that is able to embed complete Semantic Web knowledge graphs in a low dimensional feature space, where each entity and relation in the knowledge graph is represented as a numerical vector. Projecting such latent representations of entities into a lower dimensional feature space shows that semantically similar entities appear closer to each other. We use several Semantic Web knowledge graphs to show that such latent representation of entities have high relevance for different data mining tasks. Furthermore, we show that such features can be easily reused for different datasets and different tasks. In Part III, we describe a list of applications that exploit Semantic Web knowledge graphs, besides the standard data mining tasks, like classification and regression. We show that the approaches developed in Part I and Part II can be used in applications in various domains. More precisely, we show that Semantic Web graphs can be exploited for analyzing statistics, building recommender systems, entity and document modeling, and taxonomy induction. %In Part III, we focus on semantic annotations in HTML pages, which are another realization of the Semantic Web vision. Semantic annotations are integrated into the code of HTML pages using markup languages, like Microformats, RDFa, and Microdata. While such data covers various domains and topics, and can be useful for developing various data mining applications, additional steps of cleaning and integrating the data need to be performed. In this thesis, we describe a set of approaches for processing long literals and images extracted from semantic annotations in HTML pages. We showcase the approaches in the e-commerce domain. Such approaches contribute in building and consuming Semantic Web knowledge graphs

    Complex networks in audit:A data-driven modelling approach

    Get PDF
    In this thesis, we introduce data-driven audit methods using a network-based approach. Utilizing data from over 300 companies, it transforms transaction data into a network format, providing auditors with a clear overview of a company's financial structure. Chapter 2 details the financial statements network, designed for straightforward interpretation by auditors. This network effectively represents the company's financial structure, aiding in developing universal data-driven audit methods. Chapter 3's analysis reveals that the financial account nodes' degree distribution typically follows a heavy-tail distribution. Moreover, we found only minor variations in network statistics across industries. These findings help establish baseline expectations for network statistics, facilitating risk assessment. Chapter 4 addresses the complexity of these networks, proposing a method to simplify them into a more understandable high-level structure for auditors. Chapter 5 explores a similarity measure to compare financial structures, helping auditors identify deviations in a client's financial network compared to peers or historical data. Deviations could signal increased audit risks. In summary, we pioneer data-driven audit methods using financial statement networks, providing new insights and tools for auditors and paving the way for more efficient and effective audit processes

    Bayesian optimisation for automated machine learning

    Get PDF
    In this thesis, we develop a rich family of efficient and performant Bayesian optimisation (BO) methods to tackle various AutoML tasks. We first introduce a fast information-theoretic BO method, FITBO, that overcomes the computation bottleneck of information-theoretic acquisition functions while maintaining their competitiveness on the noisy optimisation problems frequently encountered in AutoML. We then improve on the idea of local penalisation and develop an asynchronous batch BO solution, PLAyBOOK, to enable more efficient use of parallel computing resources when evaluation runtime varies across configurations. In view of the fact that many practical AutoML problems involve a mixture of multiple continuous and multiple categorical variables, we propose a new framework, named Continuous and Categorical BO (CoCaBO) to handle such mixed-type input spaces. CoCaBO merges the strengths of multi-armed bandits on categorical inputs and that of BO on continuous space, and uses a tailored kernel to permit information sharing across different categorical variables. We also extend CoCaBO by harnessing the concept of local trust region to achieve competitive performance on high-dimensional optimisation problems with mixed input types. Beyond hyper-parameter tuning, we also investigate the novel use of BO on two important AutoML applications: black-box adversarial attack and neural architecture search. For the former (adversarial attack), we introduce the first BO-based attacks on image and graph classifiers; by actively querying the unknown victim classifier, our BO attacks can successfully find adversarial perturbations with many fewer attempts than competing baselines. They can thus serve as efficient tools for assessing the robustness of models suggested by AutoML. For the latter (neural architecture search), we leverage the Weisfeiler-Lehamn graph kernel to empower our BO search strategy, NAS-BOWL, to naturally handle the directed acyclic graph representation of architectures. Besides achieving superior query efficiency, our NAS-BOWL also returns interpretable sub-features that help explain the architecture performance, thus marking the first step towards interpretable neural architecture search. Finally, we examine the most computation-intense step in AutoML pipeline: generalisation performance evaluation for a new configuration. We propose a cheap yet reliable test performance estimator based on a simple measure of training speed. It consistently outperforms various existing estimators on on a wide range of architecture search spaces and and can be easily incorporated into different search strategies, including BO, to improve the cost efficiency