1,970 research outputs found
Modeling heterogeneity in random graphs through latent space models: a selective review
We present a selective review on probabilistic modeling of heterogeneity in
random graphs. We focus on latent space models and more particularly on
stochastic block models and their extensions that have undergone major
developments in the last five years
The method of moments and degree distributions for network models
Probability models on graphs are becoming increasingly important in many
applications, but statistical tools for fitting such models are not yet well
developed. Here we propose a general method of moments approach that can be
used to fit a large class of probability models through empirical counts of
certain patterns in a graph. We establish some general asymptotic properties of
empirical graph moments and prove consistency of the estimates as the graph
size grows for all ranges of the average degree including .
Additional results are obtained for the important special case of degree
distributions.Comment: Published in at http://dx.doi.org/10.1214/11-AOS904 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Canonical Correlation Analysis And Network Data Modeling: Statistical And Computational Properties
Classical decision theory evaluates an estimator mostly by its statistical properties, either the closeness to the underlying truth or the predictive ability for new observations. The goal is to find estimators to achieve statistical optimality. Modern Big Data applications, however, necessitate efficient processing of large-scale ( big-n-big-p\u27 ) datasets, which poses great challenge to classical decision-theoretic framework which seldom takes into account the scalability of estimation procedures. On the one hand, statistically optimal estimators could be computationally intensive and on the other hand, fast estimation procedures might suffer from a loss of statistical efficiency. So the challenge is to kill two birds with one stone. This thesis brings together statistical and computational perspectives to study canonical correlation analysis (CCA) and network data modeling, where we investigate both the optimality and the scalability of the estimators. Interestingly, in both cases, we find iterative estimation procedures based on non-convex optimization can significantly reduce the computational cost and meanwhile achieve desirable statistical properties.
In the first part of the thesis, motivated by the recent success of using CCA to learn low-dimensional feature representations of high-dimensional objects, we propose novel metrics which quantify the estimation loss of CCA by the excess prediction loss defined through a prediction-after-dimension-reduction framework. These new metrics have rich statistical and geometric interpretations, which suggest viewing CCA estimation as estimating the subspaces spanned by the canonical variates.
We characterize, with minimal assumptions, the non-asymptotic minimax rates under the proposed error metrics, especially how the minimax rates depend on the key quantities including the dimensions, the condition number of the covariance matrices and the canonical correlations. Finally, by formulating sample CCA as a non-convex optimization problem, we propose an efficient (stochastic) first order algorithm which scales to large datasets.
In the second part of the thesis, we propose two universal fitting algorithms for networks (possibly with edge covariates) under latent space models: one based on finding the exact maximizer of a convex surrogate of the non-convex likelihood function and the other based on finding an approximate optimizer of the original non-convex objective. Both algorithms are motivated by a special class of inner-product models but are shown to work for a much wider range of latent space models which allow the latent vectors to determine the connection probability of the edges in flexible ways. We derive the statistical rates of convergence of both algorithms and characterize the basin-of-attraction of the non-convex approach. The effectiveness and efficiency of the non-convex procedure is demonstrated by extensive simulations and real-data experiments
Pairwise Covariates-adjusted Block Model for Community Detection
One of the most fundamental problems in network study is community detection.
The stochastic block model (SBM) is one widely used model for network data with
different estimation methods developed with their community detection
consistency results unveiled. However, the SBM is restricted by the strong
assumption that all nodes in the same community are stochastically equivalent,
which may not be suitable for practical applications. We introduce a pairwise
covariates-adjusted stochastic block model (PCABM), a generalization of SBM
that incorporates pairwise covariate information. We study the maximum
likelihood estimates of the coefficients for the covariates as well as the
community assignments. It is shown that both the coefficient estimates of the
covariates and the community assignments are consistent under suitable sparsity
conditions. Spectral clustering with adjustment (SCWA) is introduced to
efficiently solve PCABM. Under certain conditions, we derive the error bound of
community estimation under SCWA and show that it is community detection
consistent. PCABM compares favorably with the SBM or degree-corrected
stochastic block model (DCBM) under a wide range of simulated and real networks
when covariate information is accessible.Comment: 41 pages, 6 figure
Likelihood Inference for Large Scale Stochastic Blockmodels with Covariates based on a Divide-and-Conquer Parallelizable Algorithm with Communication
We consider a stochastic blockmodel equipped with node covariate information,
that is helpful in analyzing social network data. The key objective is to
obtain maximum likelihood estimates of the model parameters. For this task, we
devise a fast, scalable Monte Carlo EM type algorithm based on case-control
approximation of the log-likelihood coupled with a subsampling approach. A key
feature of the proposed algorithm is its parallelizability, by processing
portions of the data on several cores, while leveraging communication of key
statistics across the cores during each iteration of the algorithm. The
performance of the algorithm is evaluated on synthetic data sets and compared
with competing methods for blockmodel parameter estimation. We also illustrate
the model on data from a Facebook derived social network enhanced with node
covariate information.Comment: 28 pages, 4 figure
- …