792 research outputs found
Partial mixture model for tight clustering of gene expression time-course
Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to
this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored.
Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate
information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a
simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms.
Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset
under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion
Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm
We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/
High-resolution temporal profiling of transcripts during Arabidopsis leaf senescence reveals a distinct chronology of processes and regulation
Leaf senescence is an essential developmental process that impacts dramatically on crop yields and involves altered
regulation of thousands of genes and many metabolic and signaling pathways, resulting in major changes in the leaf. The
regulation of senescence is complex, and although senescence regulatory genes have been characterized, there is little
information on how these function in the global control of the process. We used microarray analysis to obtain a highresolution
time-course profile of gene expression during development of a single leaf over a 3-week period to senescence.
A complex experimental design approach and a combination of methods were used to extract high-quality replicated data
and to identify differentially expressed genes. The multiple time points enable the use of highly informative clustering to
reveal distinct time points at which signaling and metabolic pathways change. Analysis of motif enrichment, as well
as comparison of transcription factor (TF) families showing altered expression over the time course, identify clear groups
of TFs active at different stages of leaf development and senescence. These data enable connection of metabolic
processes, signaling pathways, and specific TF activity, which will underpin the development of network models to
elucidate the process of senescence
Bayesian correlated clustering to integrate multiple datasets
Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct β but often complementary β information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets.
Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDIβs performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques β as well as to non-integrative approaches β demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods
Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics
Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.
google.com/site/gaussianbhc
Model-based clustering with data correction for removing artifacts in gene expression data
The NIH Library of Integrated Network-based Cellular Signatures (LINCS)
contains gene expression data from over a million experiments, using Luminex
Bead technology. Only 500 colors are used to measure the expression levels of
the 1,000 landmark genes measured, and the data for the resulting pairs of
genes are deconvolved. The raw data are sometimes inadequate for reliable
deconvolution leading to artifacts in the final processed data. These include
the expression levels of paired genes being flipped or given the same value,
and clusters of values that are not at the true expression level. We propose a
new method called model-based clustering with data correction (MCDC) that is
able to identify and correct these three kinds of artifacts simultaneously. We
show that MCDC improves the resulting gene expression data in terms of
agreement with external baselines, as well as improving results from subsequent
analysis.Comment: 28 page
Recommended from our members
GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution.
MOTIVATION: Many methods have been developed to cluster genes on the basis of their changes in mRNA expression over time, using bulk RNA-seq or microarray data. However, single-cell data may present a particular challenge for these algorithms, since the temporal ordering of cells is not directly observed. One way to address this is to first use pseudotime methods to order the cells, and then apply clustering techniques for time course data. However, pseudotime estimates are subject to high levels of uncertainty, and failing to account for this uncertainty is liable to lead to erroneous and/or over-confident gene clusters. RESULTS: The proposed method, GPseudoClust, is a novel approach that jointly infers pseudotemporal ordering and gene clusters, and quantifies the uncertainty in both. GPseudoClust combines a recent method for pseudotime inference with non-parametric Bayesian clustering methods, efficient Markov Chain Monte Carlo sampling and novel subsampling strategies which aid computation. We consider a broad array of simulated and experimental datasets to demonstrate the effectiveness of GPseudoClust in a range of settings. AVAILABILITY AND IMPLEMENTATION: An implementation is available on GitHub: https://github.com/magStra/nonparametricSummaryPSM and https://github.com/magStra/GPseudoClust. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Reconstructing regulatory networks from high-throughput post-genomic data using MCMC methods
Modern biological research aims to understand when genes are expressed and
how certain genes in
uence the expression of other genes. For organizing and visualizing
gene expression activity gene regulatory networks are used. The architecture
of these networks holds great importance, as they enable us to identify inconsistencies
between hypotheses and observations, and to predict the behavior of biological
processes in yet untested conditions.
Data from gene expression measurements are used to construct gene regulatory
networks. Along with the advance of high-throughput technologies for measuring
gene expression statistical methods to predict regulatory networks have also
been evolving. This thesis presents a computational framework based on a Bayesian
modeling technique using state space models (SSM) for the inference of gene regulatory
networks from time-series measurements.
A linear SSM consists of observation and hidden state equations. The hidden
variables can unfold effects that cannot be directly measured in an experiment, such
as missing gene expression. We have used a Bayesian MCMC approach based on
Gibbs sampling for the inference of parameters. However the task of determining
the dimension of the hidden state space variables remains crucial for the accuracy
of network inference. For this we have used the Bayesian evidence (or marginal
likelihood) as a yardstick. In addition, the Bayesian approach also provides the
possibility of incorporating prior information, based on literature knowledge.
We compare marginal likelihoods calculated from the Gibbs sampler output
to the lower bound calculated by a variational approximation. Before using the
algorithm for the analysis of real biological experimental datasets we perform validation
tests using numerical experiments based on simulated time series datasets
generated by in-silico networks. The robustness of our algorithm can be measured
by its ability to recapture the input data and generating networks using the inferred
parameters.
Our developed algorithm, GBSSM, was used to infer a gene network using
E. coli data sets from the different stress conditions of temperature shift and acid
stress. The resulting model for the gene expression response under temperature shift
captures the effects of global transcription factors, such as fnr that control the regulation
of hundreds of other genes. Interestingly, we also observe the stress-inducible membrane protein OsmC regulating transcriptional activity involved in the adaptation
mechanism under both temperature shift and acid stress conditions. In the case
of acid stress, integration of metabolomic and transcriptome data suggests that the
observed rapid decrease in the concentration of glycine betaine is the result of the
activation of osmoregulators which may play a key role in acid stress adaptation
Robust Algorithms for Detecting Hidden Structure in Biological Data
Biological data, such as molecular abundance measurements and protein
sequences, harbor complex hidden structure that reflects its underlying
biological mechanisms. For example, high-throughput abundance measurements
provide a snapshot the global state of a living cell, while homologous
protein sequences encode the residue-level logic of the proteins\u27 function
and provide a snapshot of the evolutionary trajectory of the protein family.
In this work I describe algorithmic approaches and analysis software I
developed for uncovering hidden structure in both kinds of data.
Clustering is an unsurpervised machine learning technique commonly used
to map the structure of data collected in high-throughput experiments,
such as quantification of gene expression by DNA microarrays or
short-read sequencing. Clustering algorithms always yield a partitioning
of the data, but relying on a single partitioning solution can lead to
spurious conclusions. In particular, noise in the data can cause objects
to fall into the same cluster by chance rather than due to meaningful
association. In the first part of this thesis I demonstrate approaches to
clustering data robustly in the presence of noise and apply robust clustering
to analyze the transcriptional response to injury in a neuron cell.
In the second part of this thesis I describe identifying hidden specificity
determining residues (SDPs) from alignments of protein sequences descended
through gene duplication from a common ancestor (paralogs) and apply the
approach to identify numerous putative SDPs in bacterial transcription
factors in the LacI family. Finally, I describe and demonstrate a new
algorithm for reconstructing the history of duplications by which paralogs
descended from their common ancestor. This algorithm addresses the
complexity of such reconstruction due to indeterminate or erroneous
homology assignments made by sequence alignment algorithms and to the
vast prevalence of divergence through speciation over divergence through
gene duplication in protein evolution
- β¦