102,827 research outputs found

    Variational approximation for mixtures of linear mixed models

    Full text link
    Mixtures of linear mixed models (MLMMs) are useful for clustering grouped data and can be estimated by likelihood maximization through the EM algorithm. The conventional approach to determining a suitable number of components is to compare different mixture models using penalized log-likelihood criteria such as BIC.We propose fitting MLMMs with variational methods which can perform parameter estimation and model selection simultaneously. A variational approximation is described where the variational lower bound and parameter updates are in closed form, allowing fast evaluation. A new variational greedy algorithm is developed for model selection and learning of the mixture components. This approach allows an automatic initialization of the algorithm and returns a plausible number of mixture components automatically. In cases of weak identifiability of certain model parameters, we use hierarchical centering to reparametrize the model and show empirically that there is a gain in efficiency by variational algorithms similar to that in MCMC algorithms. Related to this, we prove that the approximate rate of convergence of variational algorithms by Gaussian approximation is equal to that of the corresponding Gibbs sampler which suggests that reparametrizations can lead to improved convergence in variational algorithms as well.Comment: 36 pages, 5 figures, 2 tables, submitted to JCG

    Probabilistic Multilevel Clustering via Composite Transportation Distance

    Full text link
    We propose a novel probabilistic approach to multilevel clustering problems based on composite transportation distance, which is a variant of transportation distance where the underlying metric is Kullback-Leibler divergence. Our method involves solving a joint optimization problem over spaces of probability measures to simultaneously discover grouping structures within groups and among groups. By exploiting the connection of our method to the problem of finding composite transportation barycenters, we develop fast and efficient optimization algorithms even for potentially large-scale multilevel datasets. Finally, we present experimental results with both synthetic and real data to demonstrate the efficiency and scalability of the proposed approach.Comment: 25 pages, 3 figure

    An Experiment with Hierarchical Bayesian Record Linkage

    Full text link
    In record linkage (RL), or exact file matching, the goal is to identify the links between entities with information on two or more files. RL is an important activity in areas including counting the population, enhancing survey frames and data, and conducting epidemiological and follow-up studies. RL is challenging when files are very large, no accurate personal identification (ID) number is present on all files for all units, and some information is recorded with error. Without an unique ID number one must rely on comparisons of names, addresses, dates, and other information to find the links. Latent class models can be used to automatically score the value of information for determining match status. Data for fitting models come from comparisons made within groups of units that pass initial file blocking requirements. Data distributions can vary across blocks. This article examines the use of prior information and hierarchical latent class models in the context of RL.Comment: 14 pages, 0 figure

    Semiparametric Bayesian Density Estimation with Disparate Data Sources: A Meta-Analysis of Global Childhood Undernutrition

    Full text link
    Undernutrition, resulting in restricted growth, and quantified here using height-for-age z-scores, is an important contributor to childhood morbidity and mortality. Since all levels of mild, moderate and severe undernutrition are of clinical and public health importance, it is of interest to estimate the shape of the z-scores' distributions. We present a finite normal mixture model that uses data on 4.3 million children to make annual country-specific estimates of these distributions for under-5-year-old children in the world's 141 low- and middle-income countries between 1985 and 2011. We incorporate both individual-level data when available, as well as aggregated summary statistics from studies whose individual-level data could not be obtained. We place a hierarchical Bayesian probit stick-breaking model on the mixture weights. The model allows for nonlinear changes in time, and it borrows strength in time, in covariates, and within and across regional country clusters to make estimates where data are uncertain, sparse, or missing. This work addresses three important problems that often arise in the fields of public health surveillance and global health monitoring. First, data are always incomplete. Second, different data sources commonly use different reporting metrics. Last, distributions, and especially their tails, are often of substantive interest.Comment: 41 total pages, 6 figures, 1 tabl

    Uncovering distinct protein-network topologies in heterogeneous cell populations

    Get PDF
    Background: Cell biology research is fundamentally limited by the number of intracellular components, particularly proteins, that can be co-measured in the same cell. Therefore, cell-to-cell heterogeneity in unmeasured proteins can lead to completely different observed relations between the same measured proteins. Attempts to infer such relations in a heterogeneous cell population can yield uninformative average relations if only one underlying biochemical network is assumed. To address this, we developed a method that recursively couples an iterative unmixing process with a Bayesian analysis of each unmixed subpopulation. Results: Our approach enables to identify the number of distinct cell subpopulations, unmix their corresponding observations and resolve the network structure of each subpopulation. Using simulations of the MAPK pathway upon EGF and NGF stimulations we assess the performance of the method. We demonstrate that the presented method can identify better than clustering approaches the number of subpopulations within a mixture of observations, thus resolving correctly the statistical relations between the proteins. Conclusions: Coupling the unmixing of multiplexed observations with the inference of statistical relations between the measured parameters is essential for the success of both of these processes. Here we present a conceptual and algorithmic solution to achieve such coupling and hence to analyze data obtained from a natural mixture of cell populations. As the technologies and necessity for multiplexed measurements are rising in the systems biology era, this work addresses an important current challenge in the analysis of the derived data.Fil: Wieczorek, Jakob. Universitat Dortmund; AlemaniaFil: Malik Sheriff, Rahuman S.. Institut Max Planck fur Molekulare Physiologie; Alemania. Imperial College London; Reino Unido. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Fermin, Yessica. Universitat Dortmund; AlemaniaFil: Grecco, Hernan Edgardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Física de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Física de Buenos Aires; Argentina. Institut Max Planck fur Molekulare Physiologie; AlemaniaFil: Zamir, Eli. Institut Max Planck fur Molekulare Physiologie; AlemaniaFil: Ickstadt, Katja. Universitat Dortmund; Alemani

    Identifying Mixtures of Mixtures Using Bayesian Estimation

    Get PDF
    The use of a finite mixture of normal distributions in model-based clustering allows to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition this prior allows to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semi-parametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark data sets.Comment: 49 page
    • …
    corecore