206 research outputs found
Convergence of the groups posterior distribution in latent or stochastic block models
We propose a unified framework for studying both latent and stochastic block
models, which are used to cluster simultaneously rows and columns of a data
matrix. In this new framework, we study the behaviour of the groups posterior
distribution, given the data. We characterize whether it is possible to
asymptotically recover the actual groups on the rows and columns of the matrix,
relying on a consistent estimate of the parameter. In other words, we establish
sufficient conditions for the groups posterior distribution to converge (as the
size of the data increases) to a Dirac mass located at the actual (random)
groups configuration. In particular, we highlight some cases where the model
assumes symmetries in the matrix of connection probabilities that prevents
recovering the original groups. We also discuss the validity of these results
when the proportion of non-null entries in the data matrix converges to zero.Comment: Published at http://dx.doi.org/10.3150/13-BEJ579 in the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Assessing the Distribution Consistency of Sequential Data
Given n observations, we study the consistency of a batch of k new
observations, in terms of their distribution function. We propose a
non-parametric, non-likelihood test based on Edgeworth expansion of the
distribution function. The keypoint is to approximate the distribution of the
n+k observations by the distribution of n-k among the n observations. Edgeworth
expansion gives the correcting term and the rate of convergence. We also study
the discrete distribution case, for which Cram\`er's condition of smoothness is
not satisfied. The rate of convergence for the various cases are compared.Comment: 20 pages, 0 figure
Uncovering latent structure in valued graphs: A variational approach
As more and more network-structured data sets are available, the statistical
analysis of valued graphs has become common place. Looking for a latent
structure is one of the many strategies used to better understand the behavior
of a network. Several methods already exist for the binary case. We present a
model-based strategy to uncover groups of nodes in valued graphs. This
framework can be used for a wide span of parametric random graphs models and
allows to include covariates. Variational tools allow us to achieve approximate
maximum likelihood estimation of the parameters of these models. We provide a
simulation study showing that our estimation method performs well over a broad
range of situations. We apply this method to analyze host--parasite interaction
networks in forest ecosystems.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS361 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Une revue bibliographique de la classification croisée au travers du modèle des blocs latents
International audienceWe present here model-based co-clustering methods, with a focus on the latent block model (LBM). We introduce several specifications of the LBM (standard, sparse, Bayesian) and review some identifiability results. We show how the complex dependency structure prevents standard maximum likelihood estimation and present alternative and popular inference methods. Those estimation methods are based on a tractable approximation of the likelihood and rely on iterative procedures, which makes them difficult to analyze. We nevertheless present some asymptotic results for consistency. The results are partial as they rely on a reasonable but still unproved condition. Likewise, available model selection tools for choosing the number of groups in rows and columns are only valid up to a conjecture. We also briefly discuss non model-based co-clustering procedures. Finally, we show how LBM can be used for bipartite graph analysis and highlight throughout this review its connection to the Stochastic Block Model.Nous présentons ici les méthodes de co-clustering, avec une emphase sur les modèles à blocs latents (LBM) et les parallèles qui existent entre le LBM et le Modèle à Blocs Stochastiques (SBM), notamment pour l'analyse de graphes bipartites. Nous introduisons différentes variantes du LBM (standard, sparse, bayésien) et présentons des résultats d'identifiabilité. Nous montrons comment la structure de dépendance complexe induite par le LBM rend l'estimation des paramètres par maximum de vraisemblance impossible en pratique et passons en revue des méthodes d'inférence alternatives. Ces dernières sont basées sur des procédures itératives, combinées à des approximations faciles à maximiser de la vraisemblance, ce qui les rend malaisés à analyser théoriquement. Il existe néanmoins des résultats de consistence, partiels en ce qu'ils reposent sur une condition raisonnable mais encore non démontrée. De même, les outils de sélection de modèle actuellement disponibles pour choisir le nombre de cluster reposent sur une conjecture. Nous replacons brièvement LBM dans le contexte des méthodes de co-clustering qui ne s'appuient pas sur un modèle génératif, particulièrement celles basées sur la factorisation de matrices. Nous concluons avec une étude de cas qui illustre les avantages du co-clustering sur le clustering simple
Variational inference for sparse network reconstruction from count data
In multivariate statistics, the question of finding direct interactions can
be formulated as a problem of network inference - or network reconstruction -
for which the Gaussian graphical model (GGM) provides a canonical framework.
Unfortunately, the Gaussian assumption does not apply to count data which are
encountered in domains such as genomics, social sciences or ecology.
To circumvent this limitation, state-of-the-art approaches use two-step
strategies that first transform counts to pseudo Gaussian observations and then
apply a (partial) correlation-based approach from the abundant literature of
GGM inference. We adopt a different stance by relying on a latent model where
we directly model counts by means of Poisson distributions that are conditional
to latent (hidden) Gaussian correlated variables. In this multivariate Poisson
lognormal-model, the dependency structure is completely captured by the latent
layer. This parametric model enables to account for the effects of covariates
on the counts.
To perform network inference, we add a sparsity inducing constraint on the
inverse covariance matrix of the latent Gaussian vector. Unlike the usual
Gaussian setting, the penalized likelihood is generally not tractable, and we
resort instead to a variational approach for approximate likelihood
maximization. The corresponding optimization problem is solved by alternating a
gradient ascent on the variational parameters and a graphical-Lasso step on the
covariance matrix.
We show that our approach is highly competitive with the existing methods on
simulation inspired from microbiological data. We then illustrate on three
various data sets how accounting for sampling efforts via offsets and
integrating external covariates (which is mostly never done in the existing
literature) drastically changes the topology of the inferred network
Multiple Comparative Metagenomics using Multiset k-mer Counting
Background. Large scale metagenomic projects aim to extract biodiversity
knowledge between different environmental conditions. Current methods for
comparing microbial communities face important limitations. Those based on
taxonomical or functional assignation rely on a small subset of the sequences
that can be associated to known organisms. On the other hand, de novo methods,
that compare the whole sets of sequences, either do not scale up on ambitious
metagenomic projects or do not provide precise and exhaustive results.
Methods. These limitations motivated the development of a new de novo
metagenomic comparative method, called Simka. This method computes a large
collection of standard ecological distances by replacing species counts by
k-mer counts. Simka scales-up today's metagenomic projects thanks to a new
parallel k-mer counting strategy on multiple datasets.
Results. Experiments on public Human Microbiome Project datasets demonstrate
that Simka captures the essential underlying biological structure. Simka was
able to compute in a few hours both qualitative and quantitative ecological
distances on hundreds of metagenomic samples (690 samples, 32 billions of
reads). We also demonstrate that analyzing metagenomes at the k-mer level is
highly correlated with extremely precise de novo comparison techniques which
rely on all-versus-all sequences alignment strategy or which are based on
taxonomic profiling
- …