Search CORE

21,998 research outputs found

Variational approximation for mixtures of linear mixed models

Author: Armagan A.
Attias H.
Booth J.G.
Corduneanu A.
David J. Nott
Dempster A.P.
Meng X.L.
Papaspiliopoulos O.
Sahu S.K.
Scharl T.
Siew Li Tan
Verbeek J.J.
Wang B.
Waterhouse S.
Winn J.
Wu B.
Yeung K.Y.
———
Publication venue: 'Informa UK Limited'
Publication date: 29/08/2012
Field of study

Mixtures of linear mixed models (MLMMs) are useful for clustering grouped data and can be estimated by likelihood maximization through the EM algorithm. The conventional approach to determining a suitable number of components is to compare different mixture models using penalized log-likelihood criteria such as BIC.We propose fitting MLMMs with variational methods which can perform parameter estimation and model selection simultaneously. A variational approximation is described where the variational lower bound and parameter updates are in closed form, allowing fast evaluation. A new variational greedy algorithm is developed for model selection and learning of the mixture components. This approach allows an automatic initialization of the algorithm and returns a plausible number of mixture components automatically. In cases of weak identifiability of certain model parameters, we use hierarchical centering to reparametrize the model and show empirically that there is a gain in efficiency by variational algorithms similar to that in MCMC algorithms. Related to this, we prove that the approximate rate of convergence of variational algorithms by Gaussian approximation is equal to that of the corresponding Gibbs sampler which suggests that reparametrizations can lead to improved convergence in variational algorithms as well.Comment: 36 pages, 5 figures, 2 tables, submitted to JCG

arXiv.org e-Print Archive

Crossref

FigShare

Mixtures of Regression Models for Time-Course Gene Expression Data: Evaluation of Initialization and Random Effects

Author: Bar-Joseph
Bettina Grün
Biernacki
Celeux
Celeux
Cho
Dempster
Diebolt
Fraley
Friedrich Leisch
Grün
Handl
Hubert
Karatzoglou
Leisch
Luan
Ma
Ng
Ng
R Development Core Team
Ramoni
Scharl
Thalamuthu
Theresa Scharl
Wehrens
Publication venue
Publication date: 01/01/2009
Field of study

Finite mixture models are routinely applied to time course microarray data. Due to the complexity and size of this type of data the choice of good starting values plays an important role. So far initialization strategies have only been investigated for data from a mixture of multivariate normal distributions. In this work several initialization procedures are evaluated for mixtures of regression models with and without random effects in an extensive simulation study on different artificial datasets. Finally these procedures are also applied to a real dataset from E. coli

Crossref

Open Access LMU

Research Online

Partial mixture model for tight clustering of gene expression time-course

Author: Li Chang-Tsun
Wilson Roland
Yuan Yinyin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored. Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms. Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion

Deakin Research Online

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Warwick Research Archives Portal Repository

Bayesian correlated clustering to integrate multiple datasets

Author: Balasubramanian
Barash
Brock
Carlson
Cheng
Cherry
Cho
Cooke
Datta
David L. Wild
Dempster
Friedman
Fritsch
Granovskaia
Green
Harbison
Hubert
Huttenhower
Ideker
Ishwaran
Jackson
Jackson
Jansen
Jim E. Griffin
Kirk
Lee
Liu
Liu
Lockhart
Mistry
Myers
Myers
Neal
Neal
Nieto-Barajas
Paul Kirk
Puig
Rand
Rasmussen
Rasmussen
Reiss
Rhodes
Richard S. Savage
Rigaut
Rogers
Rogers
Rousseau
Santisteban
Savage
Schena
Shen
Solomon
Stark
Suchard
Troyanskaya
Wei
Wong
Yeung
Yuan
Zoubin Ghahramani
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods

CiteSeerX

Crossref

PubMed Central

Warwick Research Archives Portal Repository

Kent Academic Repository

Network inference and community detection, based on covariance matrices, correlations and test statistics from arbitrary distributions

Author: Bartlett Thomas E.
Publication venue
Publication date: 13/05/2016
Field of study

In this paper we propose methodology for inference of binary-valued adjacency matrices from various measures of the strength of association between pairs of network nodes, or more generally pairs of variables. This strength of association can be quantified by sample covariance and correlation matrices, and more generally by test-statistics and hypothesis test p-values from arbitrary distributions. Community detection methods such as block modelling typically require binary-valued adjacency matrices as a starting point. Hence, a main motivation for the methodology we propose is to obtain binary-valued adjacency matrices from such pairwise measures of strength of association between variables. The proposed methodology is applicable to large high-dimensional data-sets and is based on computationally efficient algorithms. We illustrate its utility in a range of contexts and data-sets

arXiv.org e-Print Archive

UCL Discovery

FigShare

A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer

Author: A Lacourt
AE Gelfand
B Pesch
C Tarnaud
CE Antoniak
D Consonni
D Dahl
D Luce
David I Hastie
DI Ohlssen
H Ishwaran
H Ishwaran
H Ishwaran
H Zhang
Isabelle Stücker
J Molitor
J Peto
JH Lubin
JH Lubin
JS Liu
L Breiman
L Kaufman
Lamiae Azizi
M Abrahamowicz
M Kalli
M Papathomas
M Papathomas
MD Ritchie
P Papaspiliopoulos
PJ Green
R Goel
R Peto
RF MacLehose
SC Lemon
SG Walker
Silvia Liverani
SW Thurston
Sylvia Richardson
W Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/10/2013
Field of study

A common characteristic of environmental epidemiology is the multi-dimensional aspect of exposure patterns, frequently reduced to a cumulative exposure for simplicity of analysis. By adopting a flexible Bayesian clustering approach, we explore the risk function linking exposure history to disease. This approach is applied here to study the relationship between different smoking characteristics and lung cancer in the framework of a population based case control study

Crossref

Springer - Publisher Connector

HAL-Inserm

PubMed Central

Queen Mary Research Online

Brunel University Research Archive

HAL UVSQ