102,827 research outputs found
Variational approximation for mixtures of linear mixed models
Mixtures of linear mixed models (MLMMs) are useful for clustering grouped
data and can be estimated by likelihood maximization through the EM algorithm.
The conventional approach to determining a suitable number of components is to
compare different mixture models using penalized log-likelihood criteria such
as BIC.We propose fitting MLMMs with variational methods which can perform
parameter estimation and model selection simultaneously. A variational
approximation is described where the variational lower bound and parameter
updates are in closed form, allowing fast evaluation. A new variational greedy
algorithm is developed for model selection and learning of the mixture
components. This approach allows an automatic initialization of the algorithm
and returns a plausible number of mixture components automatically. In cases of
weak identifiability of certain model parameters, we use hierarchical centering
to reparametrize the model and show empirically that there is a gain in
efficiency by variational algorithms similar to that in MCMC algorithms.
Related to this, we prove that the approximate rate of convergence of
variational algorithms by Gaussian approximation is equal to that of the
corresponding Gibbs sampler which suggests that reparametrizations can lead to
improved convergence in variational algorithms as well.Comment: 36 pages, 5 figures, 2 tables, submitted to JCG
Probabilistic Multilevel Clustering via Composite Transportation Distance
We propose a novel probabilistic approach to multilevel clustering problems
based on composite transportation distance, which is a variant of
transportation distance where the underlying metric is Kullback-Leibler
divergence. Our method involves solving a joint optimization problem over
spaces of probability measures to simultaneously discover grouping structures
within groups and among groups. By exploiting the connection of our method to
the problem of finding composite transportation barycenters, we develop fast
and efficient optimization algorithms even for potentially large-scale
multilevel datasets. Finally, we present experimental results with both
synthetic and real data to demonstrate the efficiency and scalability of the
proposed approach.Comment: 25 pages, 3 figure
An Experiment with Hierarchical Bayesian Record Linkage
In record linkage (RL), or exact file matching, the goal is to identify the
links between entities with information on two or more files. RL is an
important activity in areas including counting the population, enhancing survey
frames and data, and conducting epidemiological and follow-up studies. RL is
challenging when files are very large, no accurate personal identification (ID)
number is present on all files for all units, and some information is recorded
with error. Without an unique ID number one must rely on comparisons of names,
addresses, dates, and other information to find the links. Latent class models
can be used to automatically score the value of information for determining
match status. Data for fitting models come from comparisons made within groups
of units that pass initial file blocking requirements. Data distributions can
vary across blocks. This article examines the use of prior information and
hierarchical latent class models in the context of RL.Comment: 14 pages, 0 figure
Semiparametric Bayesian Density Estimation with Disparate Data Sources: A Meta-Analysis of Global Childhood Undernutrition
Undernutrition, resulting in restricted growth, and quantified here using
height-for-age z-scores, is an important contributor to childhood morbidity and
mortality. Since all levels of mild, moderate and severe undernutrition are of
clinical and public health importance, it is of interest to estimate the shape
of the z-scores' distributions.
We present a finite normal mixture model that uses data on 4.3 million
children to make annual country-specific estimates of these distributions for
under-5-year-old children in the world's 141 low- and middle-income countries
between 1985 and 2011. We incorporate both individual-level data when
available, as well as aggregated summary statistics from studies whose
individual-level data could not be obtained. We place a hierarchical Bayesian
probit stick-breaking model on the mixture weights. The model allows for
nonlinear changes in time, and it borrows strength in time, in covariates, and
within and across regional country clusters to make estimates where data are
uncertain, sparse, or missing.
This work addresses three important problems that often arise in the fields
of public health surveillance and global health monitoring. First, data are
always incomplete. Second, different data sources commonly use different
reporting metrics. Last, distributions, and especially their tails, are often
of substantive interest.Comment: 41 total pages, 6 figures, 1 tabl
Uncovering distinct protein-network topologies in heterogeneous cell populations
Background: Cell biology research is fundamentally limited by the number of intracellular components, particularly proteins, that can be co-measured in the same cell. Therefore, cell-to-cell heterogeneity in unmeasured proteins can lead to completely different observed relations between the same measured proteins. Attempts to infer such relations in a heterogeneous cell population can yield uninformative average relations if only one underlying biochemical network is assumed. To address this, we developed a method that recursively couples an iterative unmixing process with a Bayesian analysis of each unmixed subpopulation. Results: Our approach enables to identify the number of distinct cell subpopulations, unmix their corresponding observations and resolve the network structure of each subpopulation. Using simulations of the MAPK pathway upon EGF and NGF stimulations we assess the performance of the method. We demonstrate that the presented method can identify better than clustering approaches the number of subpopulations within a mixture of observations, thus resolving correctly the statistical relations between the proteins. Conclusions: Coupling the unmixing of multiplexed observations with the inference of statistical relations between the measured parameters is essential for the success of both of these processes. Here we present a conceptual and algorithmic solution to achieve such coupling and hence to analyze data obtained from a natural mixture of cell populations. As the technologies and necessity for multiplexed measurements are rising in the systems biology era, this work addresses an important current challenge in the analysis of the derived data.Fil: Wieczorek, Jakob. Universitat Dortmund; AlemaniaFil: Malik Sheriff, Rahuman S.. Institut Max Planck fur Molekulare Physiologie; Alemania. Imperial College London; Reino Unido. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Fermin, Yessica. Universitat Dortmund; AlemaniaFil: Grecco, Hernan Edgardo. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de FÃsica de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de FÃsica de Buenos Aires; Argentina. Institut Max Planck fur Molekulare Physiologie; AlemaniaFil: Zamir, Eli. Institut Max Planck fur Molekulare Physiologie; AlemaniaFil: Ickstadt, Katja. Universitat Dortmund; Alemani
Identifying Mixtures of Mixtures Using Bayesian Estimation
The use of a finite mixture of normal distributions in model-based clustering
allows to capture non-Gaussian data clusters. However, identifying the clusters
from the normal components is challenging and in general either achieved by
imposing constraints on the model or by using post-processing procedures.
Within the Bayesian framework we propose a different approach based on sparse
finite mixtures to achieve identifiability. We specify a hierarchical prior
where the hyperparameters are carefully selected such that they are reflective
of the cluster structure aimed at. In addition this prior allows to estimate
the model using standard MCMC sampling methods. In combination with a
post-processing approach which resolves the label switching issue and results
in an identified model, our approach allows to simultaneously (1) determine the
number of clusters, (2) flexibly approximate the cluster distributions in a
semi-parametric way using finite mixtures of normals and (3) identify
cluster-specific parameters and classify observations. The proposed approach is
illustrated in two simulation studies and on benchmark data sets.Comment: 49 page
- …