22 research outputs found
Significance-based community detection in weighted networks
Community detection is the process of grouping strongly connected nodes in a
network. Many community detection methods for un-weighted networks have a
theoretical basis in a null model. Communities discovered by these methods
therefore have interpretations in terms of statistical signficance. In this
paper, we introduce a null for weighted networks called the continuous
configuration model. We use the model both as a tool for community detection
and for simulating weighted networks with null nodes. First, we propose a
community extraction algorithm for weighted networks which incorporates
iterative hypothesis testing under the null. We prove a central limit theorem
for edge-weight sums and asymptotic consistency of the algorithm under a
weighted stochastic block model. We then incorporate the algorithm in a
community detection method called CCME. To benchmark the method, we provide a
simulation framework incorporating the null to plant "background" nodes in
weighted networks with communities. We show that the empirical performance of
CCME on these simulations is competitive with existing methods, particularly
when overlapping communities and background nodes are present. To further
validate the method, we present two real-world networks with potential
background nodes and analyze them with CCME, yielding results that reveal
macro-features of the corresponding systems.Comment: Code and supplemental info available at
http://stats.johnpalowitch.com/ccme. V3 changes: based on lengthy referee
revision process, new theoretical sections added, + major organizational
changes. V2 changes: grant info added, 1 reference added, bibliography
section moved to end, condensed bib line spacing, corrected typo
Testing-Based Community Detection Methods for Complex Networks
Community detection is an exploratory method of grouping strongly connected nodes in a network, in most cases using only the network edge structure as a guide. Using discovered communities for downstream analyses can be crucial for real-world decision-making and inference. Recent approaches to community detection include testing-based community extraction, a process in which communities are refined one-by-one via analysis of graph statistics. However, to date, testing-based extraction methods are tied to the configuration model as a null, which applies only to single-layer, binary graphs. In this thesis, testing-based extraction is generalized to arbitrary networks types with a framework called Node-Set Testing (NST). The NST framework defines the broader statistical elements of an approach that uses hypothesis testing to detect communities in complex networks. The NST framework is applied to (i) weighted networks and (ii) bipartite correlation networks, resulting in novel community detection algorithms. In particular, new null models and test statistics are specified to apply iterative hypothesis-testing algorithms on these types of networks. Detailed analyses of the empirical and theoretical properties of the proposed methods are provided. Other chapters in this thesis, while not explicitly involving testing-based algorithms, support the discussion of community detection in heterogeneous networks. One chapter provides a consistency analysis of a significance-based score for community extraction in multilayer networks. In another chapter, preceding the discussion of the NST method for bipartite correlation networks, an application area called eQTL analysis is discussed. In particular, a new model for estimating the effect size and regression correlation of the links in an eQTL network is introduced and studied.Doctor of Philosoph
Graph Clustering with Graph Neural Networks
Graph Neural Networks (GNNs) have achieved state-of-the-art results on many
graph analysis tasks such as node classification and link prediction. However,
important unsupervised problems on graphs, such as graph clustering, have
proved more resistant to advances in GNNs. In this paper, we study unsupervised
training of GNN pooling in terms of their clustering capabilities.
We start by drawing a connection between graph clustering and graph pooling:
intuitively, a good graph clustering is what one would expect from a GNN
pooling layer. Counterintuitively, we show that this is not true for
state-of-the-art pooling methods, such as MinCut pooling. To address these
deficiencies, we introduce Deep Modularity Networks (DMoN), an unsupervised
pooling method inspired by the modularity measure of clustering quality, and
show how it tackles recovery of the challenging clustering structure of
real-world graphs. In order to clarify the regimes where existing methods fail,
we carefully design a set of experiments on synthetic data which show that DMoN
is able to jointly leverage the signal from the graph structure and node
attributes. Similarly, on real-world data, we show that DMoN produces high
quality clusters which correlate strongly with ground truth labels, achieving
state-of-the-art results
Examining the Effects of Degree Distribution and Homophily in Graph Learning Models
Despite a surge in interest in GNN development, homogeneity in benchmarking
datasets still presents a fundamental issue to GNN research. GraphWorld is a
recent solution which uses the Stochastic Block Model (SBM) to generate diverse
populations of synthetic graphs for benchmarking any GNN task. Despite its
success, the SBM imposed fundamental limitations on the kinds of graph
structure GraphWorld could create.
In this work we examine how two additional synthetic graph generators can
improve GraphWorld's evaluation; LFR, a well-established model in the graph
clustering literature and CABAM, a recent adaptation of the Barabasi-Albert
model tailored for GNN benchmarking. By integrating these generators, we
significantly expand the coverage of graph space within the GraphWorld
framework while preserving key graph properties observed in real-world
networks. To demonstrate their effectiveness, we generate 300,000 graphs to
benchmark 11 GNN models on a node classification task. We find GNN performance
variations in response to homophily, degree distribution and feature signal.
Based on these findings, we classify models by their sensitivity to the new
generators under these properties. Additionally, we release the extensions made
to GraphWorld on the GitHub repository, offering further evaluation of GNN
performance on new graphs.Comment: Accepted to Workshop on Graph Learning Benchmarks at KDD 202
Community Extraction in Multilayer Networks with Heterogeneous Community Structure.
Multilayer networks are a useful way to capture and model multiple, binary or weighted relationships among a fixed group of objects. While community detection has proven to be a useful exploratory technique for the analysis of single-layer networks, the development of community detection methods for multilayer networks is still in its infancy. We propose and investigate a procedure, called Multilayer Extraction, that identifies densely connected vertex-layer sets in multilayer networks. Multilayer Extraction makes use of a significance based score that quantifies the connectivity of an observed vertex-layer set through comparison with a fixed degree random graph model. Multilayer Extraction directly handles networks with heterogeneous layers where community structure may be different from layer to layer. The procedure can capture overlapping communities, as well as background vertex-layer pairs that do not belong to any community. We establish consistency of the vertex-layer set optimizer of our proposed multilayer score under the multilayer stochastic block model. We investigate the performance of Multilayer Extraction on three applications and a test bed of simulations. Our theoretical and numerical evaluations suggest that Multilayer Extraction is an effective exploratory tool for analyzing complex multilayer networks. Publicly available code is available at https://github.com/jdwilson4/MultilayerExtraction
Graph Generative Model for Benchmarking Graph Neural Networks
As the field of Graph Neural Networks (GNN) continues to grow, it experiences
a corresponding increase in the need for large, real-world datasets to train
and test new GNN models on challenging, realistic problems. Unfortunately, such
graph datasets are often generated from online, highly privacy-restricted
ecosystems, which makes research and development on these datasets hard, if not
impossible. This greatly reduces the amount of benchmark graphs available to
researchers, causing the field to rely only on a handful of publicly-available
datasets. To address this problem, we introduce a novel graph generative model,
Computation Graph Transformer (CGT) that learns and reproduces the distribution
of real-world graphs in a privacy-controlled way. More specifically, CGT (1)
generates effective benchmark graphs on which GNNs show similar task
performance as on the source graphs, (2) scales to process large-scale graphs,
(3) incorporates off-the-shelf privacy modules to guarantee end-user privacy of
the generated graph. Extensive experiments across a vast body of graph
generative models show that only our model can successfully generate
privacy-controlled, synthetic substitutes of large-scale real-world graphs that
can be effectively used to benchmark GNN models