6,653 research outputs found

    Gains in Power from Structured Two-Sample Tests of Means on Graphs

    Get PDF
    We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways

    Estimating Graphlet Statistics via Lifting

    Full text link
    Exploratory analysis over network data is often limited by the ability to efficiently calculate graph statistics, which can provide a model-free understanding of the macroscopic properties of a network. We introduce a framework for estimating the graphlet count---the number of occurrences of a small subgraph motif (e.g. a wedge or a triangle) in the network. For massive graphs, where accessing the whole graph is not possible, the only viable algorithms are those that make a limited number of vertex neighborhood queries. We introduce a Monte Carlo sampling technique for graphlet counts, called {\em Lifting}, which can simultaneously sample all graphlets of size up to kk vertices for arbitrary kk. This is the first graphlet sampling method that can provably sample every graphlet with positive probability and can sample graphlets of arbitrary size kk. We outline variants of lifted graphlet counts, including the ordered, unordered, and shotgun estimators, random walk starts, and parallel vertex starts. We prove that our graphlet count updates are unbiased for the true graphlet count and have a controlled variance for all graphlets. We compare the experimental performance of lifted graphlet counts to the state-of-the art graphlet sampling procedures: Waddling and the pairwise subgraph random walk

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms
    • …
    corecore