1,970 research outputs found

    Modeling heterogeneity in random graphs through latent space models: a selective review

    Get PDF
    We present a selective review on probabilistic modeling of heterogeneity in random graphs. We focus on latent space models and more particularly on stochastic block models and their extensions that have undergone major developments in the last five years

    The method of moments and degree distributions for network models

    Full text link
    Probability models on graphs are becoming increasingly important in many applications, but statistical tools for fitting such models are not yet well developed. Here we propose a general method of moments approach that can be used to fit a large class of probability models through empirical counts of certain patterns in a graph. We establish some general asymptotic properties of empirical graph moments and prove consistency of the estimates as the graph size grows for all ranges of the average degree including Ω(1)\Omega(1). Additional results are obtained for the important special case of degree distributions.Comment: Published in at http://dx.doi.org/10.1214/11-AOS904 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Canonical Correlation Analysis And Network Data Modeling: Statistical And Computational Properties

    Get PDF
    Classical decision theory evaluates an estimator mostly by its statistical properties, either the closeness to the underlying truth or the predictive ability for new observations. The goal is to find estimators to achieve statistical optimality. Modern Big Data applications, however, necessitate efficient processing of large-scale ( big-n-big-p\u27 ) datasets, which poses great challenge to classical decision-theoretic framework which seldom takes into account the scalability of estimation procedures. On the one hand, statistically optimal estimators could be computationally intensive and on the other hand, fast estimation procedures might suffer from a loss of statistical efficiency. So the challenge is to kill two birds with one stone. This thesis brings together statistical and computational perspectives to study canonical correlation analysis (CCA) and network data modeling, where we investigate both the optimality and the scalability of the estimators. Interestingly, in both cases, we find iterative estimation procedures based on non-convex optimization can significantly reduce the computational cost and meanwhile achieve desirable statistical properties. In the first part of the thesis, motivated by the recent success of using CCA to learn low-dimensional feature representations of high-dimensional objects, we propose novel metrics which quantify the estimation loss of CCA by the excess prediction loss defined through a prediction-after-dimension-reduction framework. These new metrics have rich statistical and geometric interpretations, which suggest viewing CCA estimation as estimating the subspaces spanned by the canonical variates. We characterize, with minimal assumptions, the non-asymptotic minimax rates under the proposed error metrics, especially how the minimax rates depend on the key quantities including the dimensions, the condition number of the covariance matrices and the canonical correlations. Finally, by formulating sample CCA as a non-convex optimization problem, we propose an efficient (stochastic) first order algorithm which scales to large datasets. In the second part of the thesis, we propose two universal fitting algorithms for networks (possibly with edge covariates) under latent space models: one based on finding the exact maximizer of a convex surrogate of the non-convex likelihood function and the other based on finding an approximate optimizer of the original non-convex objective. Both algorithms are motivated by a special class of inner-product models but are shown to work for a much wider range of latent space models which allow the latent vectors to determine the connection probability of the edges in flexible ways. We derive the statistical rates of convergence of both algorithms and characterize the basin-of-attraction of the non-convex approach. The effectiveness and efficiency of the non-convex procedure is demonstrated by extensive simulations and real-data experiments

    Pairwise Covariates-adjusted Block Model for Community Detection

    Full text link
    One of the most fundamental problems in network study is community detection. The stochastic block model (SBM) is one widely used model for network data with different estimation methods developed with their community detection consistency results unveiled. However, the SBM is restricted by the strong assumption that all nodes in the same community are stochastically equivalent, which may not be suitable for practical applications. We introduce a pairwise covariates-adjusted stochastic block model (PCABM), a generalization of SBM that incorporates pairwise covariate information. We study the maximum likelihood estimates of the coefficients for the covariates as well as the community assignments. It is shown that both the coefficient estimates of the covariates and the community assignments are consistent under suitable sparsity conditions. Spectral clustering with adjustment (SCWA) is introduced to efficiently solve PCABM. Under certain conditions, we derive the error bound of community estimation under SCWA and show that it is community detection consistent. PCABM compares favorably with the SBM or degree-corrected stochastic block model (DCBM) under a wide range of simulated and real networks when covariate information is accessible.Comment: 41 pages, 6 figure

    Likelihood Inference for Large Scale Stochastic Blockmodels with Covariates based on a Divide-and-Conquer Parallelizable Algorithm with Communication

    Get PDF
    We consider a stochastic blockmodel equipped with node covariate information, that is helpful in analyzing social network data. The key objective is to obtain maximum likelihood estimates of the model parameters. For this task, we devise a fast, scalable Monte Carlo EM type algorithm based on case-control approximation of the log-likelihood coupled with a subsampling approach. A key feature of the proposed algorithm is its parallelizability, by processing portions of the data on several cores, while leveraging communication of key statistics across the cores during each iteration of the algorithm. The performance of the algorithm is evaluated on synthetic data sets and compared with competing methods for blockmodel parameter estimation. We also illustrate the model on data from a Facebook derived social network enhanced with node covariate information.Comment: 28 pages, 4 figure
    • …
    corecore