659 research outputs found

    Statistical clustering of temporal networks through a dynamic stochastic block model

    Get PDF
    Statistical node clustering in discrete time dynamic networks is an emerging field that raises many challenges. Here, we explore statistical properties and frequentist inference in a model that combines a stochastic block model (SBM) for its static part with independent Markov chains for the evolution of the nodes groups through time. We model binary data as well as weighted dynamic random graphs (with discrete or continuous edges values). Our approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within group connectivity behavior. We study identifiability of the model parameters, propose an inference procedure based on a variational expectation maximization algorithm as well as a model selection criterion to select for the number of groups. We carefully discuss our initialization strategy which plays an important role in the method and compare our procedure with existing ones on synthetic datasets. We also illustrate our approach on dynamic contact networks, one of encounters among high school students and two others on animal interactions. An implementation of the method is available as a R package called dynsbm

    Convergence of the groups posterior distribution in latent or stochastic block models

    Full text link
    We propose a unified framework for studying both latent and stochastic block models, which are used to cluster simultaneously rows and columns of a data matrix. In this new framework, we study the behaviour of the groups posterior distribution, given the data. We characterize whether it is possible to asymptotically recover the actual groups on the rows and columns of the matrix, relying on a consistent estimate of the parameter. In other words, we establish sufficient conditions for the groups posterior distribution to converge (as the size of the data increases) to a Dirac mass located at the actual (random) groups configuration. In particular, we highlight some cases where the model assumes symmetries in the matrix of connection probabilities that prevents recovering the original groups. We also discuss the validity of these results when the proportion of non-null entries in the data matrix converges to zero.Comment: Published at http://dx.doi.org/10.3150/13-BEJ579 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Modeling heterogeneity in random graphs through latent space models: a selective review

    Get PDF
    We present a selective review on probabilistic modeling of heterogeneity in random graphs. We focus on latent space models and more particularly on stochastic block models and their extensions that have undergone major developments in the last five years

    On efficient estimators of the proportion of true null hypotheses in a multiple testing setup

    Full text link
    We consider the problem of estimating the proportion θ\theta of true null hypotheses in a multiple testing context. The setup is classically modeled through a semiparametric mixture with two components: a uniform distribution on interval [0,1][0,1] with prior probability θ\theta and a nonparametric density ff. We discuss asymptotic efficiency results and establish that two different cases occur whether ff vanishes on a set with non null Lebesgue measure or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (i.e. attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. We illustrate those results on simulated data

    Nonparametric estimation of the density of the alternative hypothesis in a multiple testing setup. Application to local false discovery rate estimation

    Full text link
    In a multiple testing context, we consider a semiparametric mixture model with two components where one component is known and corresponds to the distribution of pp-values under the null hypothesis and the other component ff is nonparametric and stands for the distribution under the alternative hypothesis. Motivated by the issue of local false discovery rate estimation, we focus here on the estimation of the nonparametric unknown component ff in the mixture, relying on a preliminary estimator of the unknown proportion θ\theta of true null hypotheses. We propose and study the asymptotic properties of two different estimators for this unknown component. The first estimator is a randomly weighted kernel estimator. We establish an upper bound for its pointwise quadratic risk, exhibiting the classical nonparametric rate of convergence over a class of H\"older densities. To our knowledge, this is the first result establishing convergence as well as corresponding rate for the estimation of the unknown component in this nonparametric mixture. The second estimator is a maximum smoothed likelihood estimator. It is computed through an iterative algorithm, for which we establish a descent property. In addition, these estimators are used in a multiple testing procedure in order to estimate the local false discovery rate. Their respective performances are then compared on synthetic data

    Adaptive procedures in convolution models with known or partially known noise distribution

    Get PDF
    In a convolution model, we observe random variables whose distribution is the convolution of some unknown density f and some known or partially known noise density g. In this paper, we focus on statistical procedures, which are adaptive with respect to the smoothness parameter tau of unknown density f, and also (in some cases) to some unknown parameter of the noise density g. In a first part, we assume that g is known and polynomially smooth. We provide goodness-of-fit procedures for the test H_0:f=f_0, where the alternative H_1 is expressed with respect to L_2-norm. Our adaptive (w.r.t tau) procedure behaves differently according to whether f_0 is polynomially or exponentially smooth. A payment for adaptation is noted in both cases and for computing this, we provide a non-uniform Berry-Esseen type theorem for degenerate U-statistics. In the first case we prove that the payment for adaptation is optimal (thus unavoidable). In a second part, we study a wider framework: a semiparametric model, where g is exponentially smooth and stable, and its self-similarity index s is unknown. In order to ensure identifiability, we restrict our attention to polynomially smooth, Sobolev-type densities f. In this context, we provide a consistent estimation procedure for s. This estimator is then plugged-into three different procedures: estimation of the unknown density f, of the functional \int f^2 and test of the hypothesis H_0. These procedures are adaptive with respect to both s and tau and attain the rates which are known optimal for known values of s and tau. As a by-product, when the noise is known and exponentially smooth our testing procedure is adaptive for testing Sobolev-type densities.Comment: 35 pages + annexe de 8 page

    Asymptotic normality and efficiency of the maximum likelihood estimator for the parameter of a ballistic random walk in a random environment

    Full text link
    We consider a one dimensional ballistic random walk evolving in a parametric independent and identically distributed random environment. We study the asymptotic properties of the maximum likelihood estimator of the parameter based on a single observation of the path till the time it reaches a distant site. We prove an asymptotic normality result for this consistent estimator as the distant site tends to infinity and establish that it achieves the Cram\'er-Rao bound. We also explore in a simulation setting the numerical behaviour of asymptotic confidence regions for the parameter value

    A semiparametric extension of the stochastic block model for longitudinal networks

    Full text link
    To model recurrent interaction events in continuous time, an extension of the stochastic block model is proposed where every individual belongs to a latent group and interactions between two individuals follow a conditional inhomogeneous Poisson process with intensity driven by the individuals' latent groups. The model is shown to be identifiable and its estimation is based on a semiparametric variational expectation-maximization algorithm. Two versions of the method are developed, using either a nonparametric histogram approach (with an adaptive choice of the partition size) or kernel intensity estimators. The number of latent groups can be selected by an integrated classification likelihood criterion. Finally, we demonstrate the performance of our procedure on synthetic experiments, analyse two datasets to illustrate the utility of our approach and comment on competing methods

    Identifiability of parameters in latent structure models with many observed variables

    Full text link
    While hidden class models of various types arise in many statistical applications, it is often difficult to establish the identifiability of their parameters. Focusing on models in which there is some structure of independence of some of the observed variables conditioned on hidden ones, we demonstrate a general approach for establishing identifiability utilizing algebraic arguments. A theorem of J. Kruskal for a simple latent-class model with finite state space lies at the core of our results, though we apply it to a diverse set of models. These include mixtures of both finite and nonparametric product distributions, hidden Markov models and random graph mixture models, and lead to a number of new results and improvements to old ones. In the parametric setting, this approach indicates that for such models, the classical definition of identifiability is typically too strong. Instead generic identifiability holds, which implies that the set of nonidentifiable parameters has measure zero, so that parameter inference is still meaningful. In particular, this sheds light on the properties of finite mixtures of Bernoulli products, which have been used for decades despite being known to have nonidentifiable parameters. In the nonparametric setting, we again obtain identifiability only when certain restrictions are placed on the distributions that are mixed, but we explicitly describe the conditions.Comment: Published in at http://dx.doi.org/10.1214/09-AOS689 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …