18 research outputs found

    Estimating Graphlet Statistics via Lifting

    Full text link
    Exploratory analysis over network data is often limited by the ability to efficiently calculate graph statistics, which can provide a model-free understanding of the macroscopic properties of a network. We introduce a framework for estimating the graphlet count---the number of occurrences of a small subgraph motif (e.g. a wedge or a triangle) in the network. For massive graphs, where accessing the whole graph is not possible, the only viable algorithms are those that make a limited number of vertex neighborhood queries. We introduce a Monte Carlo sampling technique for graphlet counts, called {\em Lifting}, which can simultaneously sample all graphlets of size up to kk vertices for arbitrary kk. This is the first graphlet sampling method that can provably sample every graphlet with positive probability and can sample graphlets of arbitrary size kk. We outline variants of lifted graphlet counts, including the ordered, unordered, and shotgun estimators, random walk starts, and parallel vertex starts. We prove that our graphlet count updates are unbiased for the true graphlet count and have a controlled variance for all graphlets. We compare the experimental performance of lifted graphlet counts to the state-of-the art graphlet sampling procedures: Waddling and the pairwise subgraph random walk

    Fast and Perfect Sampling of Subgraphs and Polymer Systems

    Get PDF
    We give an efficient perfect sampling algorithm for weighted, connected induced subgraphs (or graphlets) of rooted, bounded degree graphs. Our algorithm utilizes a vertex-percolation process with a carefully chosen rejection filter and works under a percolation subcriticality condition. We show that this condition is optimal in the sense that the task of (approximately) sampling weighted rooted graphlets becomes impossible in finite expected time for infinite graphs and intractable for finite graphs when the condition does not hold. We apply our sampling algorithm as a subroutine to give near linear-time perfect sampling algorithms for polymer models and weighted non-rooted graphlets in finite graphs, two widely studied yet very different problems. This new perfect sampling algorithm for polymer models gives improved sampling algorithms for spin systems at low temperatures on expander graphs and unbalanced bipartite graphs, among other applications

    Graphlet based network analysis

    Get PDF
    The majority of the existing works on network analysis, study properties that are related to the global topology of a network. Examples of such properties include diameter, power-law exponent, and spectra of graph Laplacians. Such works enhance our understanding of real-life networks, or enable us to generate synthetic graphs with real-life graph properties. However, many of the existing problems on networks require the study of local topological structures of a network. Graphlets which are induced small subgraphs capture the local topological structure of a network effectively. They are becoming increasingly popular for characterizing large networks in recent years. Graphlet based network analysis can vary based on the types of topological structures considered and the kinds of analysis tasks. For example, one of the most popular and early graphlet analyses is based on triples (triangles or paths of length two). Graphlet analysis based on cycles and cliques are also explored in several recent works. Another more comprehensive class of graphlet analysis methods works with graphlets of specific sizes—graphlets with three, four or five nodes ({3, 4, 5}-Graphlets) are particularly popular. For all the above analysis tasks, excessive computational cost is a major challenge, which becomes severe for analyzing large networks with millions of vertices. To overcome this challenge, effective methodologies are in urgent need. Furthermore, the existence of efficient methods for graphlet analysis will encourage more works broadening the scope of graphlet analysis. For graphlet counting, we propose edge iteration based methods (ExactTC and ExactGC) for efficiently computing triple and graphlet counts. The proposed methods compute local graphlet statistics in the neighborhood of each edge in the network and then aggregate the local statistics to give the global characterization (transitivity, graphlet frequency distribution (GFD), etc) of the network. Scalability of the proposed methods is further improved by iterating over a sampled set of edges and estimating the triangle count (ApproxTC) and graphlet count (Graft) by approximate rescaling of the aggregated statistics. The independence of local feature vector construction corresponding to each edge makes the methods embarrassingly parallelizable. We show this by giving a parallel edge iteration method ParApproxTC for triangle counting. For graphlet sampling, we propose Markov Chain Monte Carlo (MCMC) sampling based methods for triple and graphlet analysis. Proposed triple analysis methods, Vertex-MCMC and Triple-MCMC, estimate triangle count and network transitivity. Vertex-MCMC samples triples in two steps. First, the method selects a node (using the MCMC method) with probability proportional to the number of triples of which the node is a center. Then Vertex-MCMC samples uniformly from the triples centered by the selected node. The method Triple-MCMC samples triples by performing a MCMC walk in a triple sample space. Triple sample space consists of all the possible triples in a network. MCMC method performs triple sampling by walking form one triple to one of its neighboring triples in the triple space. We design the triple space in such a way that two triples are neighbors only if they share exactly two nodes. The proposed triple sampling algorithms Vertex-MCMC and Triple-MCMC are able to sample triples from any arbitrary distribution, as long as the weight of each triple is locally computable. The proposed methods are able to sample triples without the knowledge of the complete network structure. Information regarding only the local neighborhood structure of currently observed node or triple are enough to walk to the next node or triple. This gives the proposed methods a significant advantage: the capability to sample triples from networks that have restricted access, on which a direct sampling based method is simply not applicable. The proposed methods are also suitable for dynamic and large networks. Similar to the concept of Triple-MCMC, we propose Guise for sampling graphlets of sizes three, four and five ({3, 4, 5}-Graphlets). Guise samples graphlets, by performing a MCMC walk on a graphlet sample space, containing all the graphlets of sizes three, four and five in the network. Despite the proven utility of graphlets in static network analysis, works harnessing the ability of graphlets for dynamic network analysis are yet to come. Dynamic networks contain additional time information for their edges. With time, the topological structure of a dynamic network changes—edges can appear, disappear and reappear over time. In this direction, predicting the link state of a network at a future time, given a collection of link states at earlier times, is an important task with many real-life applications. In the existing literature, this task is known as link prediction in dynamic networks. Performing this task is more difficult than its counterpart in static networks because an effective feature representation of node-pair instances for the case of a dynamic network is hard to obtain. We design a novel graphlet transition based feature embedding for node-pair instances of a dynamic network. Our proposed method GraTFEL, uses automatic feature learning methodologies on such graphlet transition based features to give a low-dimensional feature embedding of unlabeled node-pair instances. The feature learning task is modeled as an optimal coding task where the objective is to minimize the reconstruction error. GraTFEL solves this optimization task by using a gradient descent method. We validate the effectiveness of the learned optimal feature embedding by utilizing it for link prediction in real-life dynamic networks. Specifically, we show that GraTFEL, which uses the extracted feature embedding of graphlet transition events, outperforms existing methods that use well-known link prediction features

    Neural function approximation on graphs: shape modelling, graph discrimination & compression

    Get PDF
    Graphs serve as a versatile mathematical abstraction of real-world phenomena in numerous scientific disciplines. This thesis is part of the Geometric Deep Learning subject area, a family of learning paradigms, that capitalise on the increasing volume of non-Euclidean data so as to solve real-world tasks in a data-driven manner. In particular, we focus on the topic of graph function approximation using neural networks, which lies at the heart of many relevant methods. In the first part of the thesis, we contribute to the understanding and design of Graph Neural Networks (GNNs). Initially, we investigate the problem of learning on signals supported on a fixed graph. We show that treating graph signals as general graph spaces is restrictive and conventional GNNs have limited expressivity. Instead, we expose a more enlightening perspective by drawing parallels between graph signals and signals on Euclidean grids, such as images and audio. Accordingly, we propose a permutation-sensitive GNN based on an operator analogous to shifts in grids and instantiate it on 3D meshes for shape modelling (Spiral Convolutions). Following, we focus on learning on general graph spaces and in particular on functions that are invariant to graph isomorphism. We identify a fundamental trade-off between invariance, expressivity and computational complexity, which we address with a symmetry-breaking mechanism based on substructure encodings (Graph Substructure Networks). Substructures are shown to be a powerful tool that provably improves expressivity while controlling computational complexity, and a useful inductive bias in network science and chemistry. In the second part of the thesis, we discuss the problem of graph compression, where we analyse the information-theoretic principles and the connections with graph generative models. We show that another inevitable trade-off surfaces, now between computational complexity and compression quality, due to graph isomorphism. We propose a substructure-based dictionary coder - Partition and Code (PnC) - with theoretical guarantees that can be adapted to different graph distributions by estimating its parameters from observations. Additionally, contrary to the majority of neural compressors, PnC is parameter and sample efficient and is therefore of wide practical relevance. Finally, within this framework, substructures are further illustrated as a decisive archetype for learning problems on graph spaces.Open Acces

    Maximum Likelihood Estimate in Discrete Hierarchical Log-Linear Models

    Get PDF
    Hierarchical log-linear models are essential tools used for relationship identification between variables in complex high-dimensional problems. In this thesis we study two problems: the computation and the existence of the maximum likelihood estimate (henceforth abbreviated MLE) in high-dimensional hierarchical log-linear models. When the number of variables is large, computing the MLE of the parameters is a difficult task to accomplish. A popular approach is to estimate the composite MLE rather than the MLE itself, that is, estimate the value of the parameter that maximizes the product of local conditional likelihoods. A more recent development is to choose the components of the composite likelihood to be local marginal likelihoods. We first show that the estimates obtained from local conditional and marginal likelihoods are identical. Second, we study the asymptotic properties of the composite MLE obtained by averaging the local estimates, under the double asymptotic regime, when both the dimension p and sample size N go to infinity. We compare the rate of convergence to the true parameter of the composite MLE with that of the global MLE under the same conditions. We also look at the asymptotic properties of the composite MLE when p is fixed and N goes to infinity and thus recover the same asymptotic results for p fixed as those given by Liu in 2012. The existence of the MLE in hierarchical log-linear models has important consequences for statistical inference: estimation, confidence intervals and testing as we shall see. Determining whether this estimate exists is equivalent to finding whether the data belongs to the boundary of the marginal polytope of the model or not. In 2012, Fienberg and Rinaldo gave a linear programming method that determines the smallest such face for relatively low-dimensional models. In this thesis, we consider higher-dimensional problems. We develop the methology to obtain an outer and inner approximation to the smallest face of the marginal polytope containing the data vector. Outer approximations are obtained by looking at submodels of the original hierarchical model, and inner approximations are obtained by working with larger models

    Nonparametric clustering for spatio-temporal data

    Get PDF
    Clustering algorithms attempt the identification of distinct subgroups within heterogeneous data and are commonly utilised as an exploratory tool. The definition of a cluster is dependent on the relevant dataset and associated constraints; clustering methods seek to determine homogeneous subgroups that each correspond to a distinct set of characteristics. This thesis focuses on the development of spatial clustering algorithms and the methods are motivated by the complexities posed by spatio-temporal data. The examples in this thesis primarily come from spatial structures described in the context of traffic modelling and are based on occupancy observations recorded over time for an urban road network. Levels of occupancy indicate the extent of traffic congestion and the goal is to identify distinct regions of traffic congestion in the urban road network. Spatial clustering for spatio-temporal data is an increasingly important research problem and the challenges posed by such research problems often demand the development of bespoke clustering methods. Many existing clustering algorithms, with a focus on accommodating the underlying spatial structure, do not generate clusters that adequately represent differences in the temporal pattern across the network. This thesis is primarily concerned with developing nonparametric clustering algorithms that seek to identify spatially contiguous clusters and retain underlying temporal patterns. Broadly, this thesis introduces two clustering algorithms that are capable of accommodating spatial and temporal dependencies that are inherent to the dataset. The first is a functional distributional clustering algorithm that is implemented within an agglomerative hierarchical clustering framework as a two-stage process. The method is based on a measure of distance that utilises estimated cumulative distribution functions over the data and this unique distance is both functional and distributional. This notion of distance utilises the differences in densities to identify distinct clusters in the graph, rather than raw recorded observations. However, distinct characteristics may not necessarily be identified and distinguishable by a densities-based distance measure, as defined within the agglomerative hierarchical clustering framework. In this thesis, we also introduce a formal Bayesian clustering approach that enables the researcher to determine spatially contiguous clusters in a data-driven manner. This framework varies from the set of assumptions introduced by the functional distributional clustering algorithm. This flexible Bayesian model employs a binary dependent Chinese restaurant process (binDCRP) to place a prior over the geographical constraints posed by a graph-based network. The binDCRP is a special case of the distance dependent Chinese restaurant process that was first introduced by Blei and Frazier (2011); the binDCRP is modified to account for data that poses spatial constraints. The binDCRP seeks to cluster data such that adjacent or neighbouring regions in a spatial structure are more likely to belong to the same cluster. The binDCRP introduces a large number of singletons within the spatial structure and we modify the binDCRP to enable the researcher to restrict the number of clusters in the graph. It is also reasonable to assume that individual junctions within a cluster are spatially correlated to adjacent junctions, due to the nature of traffic and the spread of congestion. In order to fully account for spatial correlation within a cluster structure, the model utilises a type of the conditional auto-regressive (CAR) model. The model also accounts for temporal dependencies using a first order auto-regressive (AR-1) model. In this mean-based flexible Bayesian model, the data is assumed to follow a Gaussian distribution and we utilise Kronecker product identities within the definition of the spatio-temporal precision matrix to improve the computational efficiency. The model utilises a Metropolis within Gibbs sampler to fully explore all possible partition structures within the network and infer the relevant parameters of the spatio-temporal precision matrix. The flexible Bayesian method is also applicable to map-based spatial structures and we describe the model in this context as well. The developed Bayesian model is applied to a simulated spatio-temporal dataset that is composed of three distinct known clusters. The differences in the clusters are reflected by distinct mean values over time associated with spatial regions. The nature of this mean-based comparison differs from the functional distributional clustering approach that seeks to identify differences across the distribution. We demonstrate the ability of the Bayesian model to restrict the number of clusters using a simulated data structure with distinctly defined clusters. The sampler is also able to explore potential cluster structures in an efficient manner and this is demonstrated using a simulated spatio-temporal data structure. The performance of this model is illustrated by an application to a dataset over an urban road network that presents traffic as a process varying continuously across space and time. We also apply this model to an areal unit dataset composed of property prices over a period of time for the Avon county in England

    36th International Symposium on Theoretical Aspects of Computer Science: STACS 2019, March 13-16, 2019, Berlin, Germany

    Get PDF
    corecore