73,614 research outputs found

    Evaluation of clustering algorithms for gene expression data

    Get PDF
    BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms

    Two Novel Methods for Clustering Short Time-Course Gene Expression Profiles

    Get PDF
    As genes with similar expression pattern are very likely having the same biological function, cluster analysis becomes an important tool to understand and predict gene functions from gene expression profi les. In many situations, each gene expression profi le only contains a few data points. Directly applying traditional clustering algorithms to such short gene expression profi les does not yield satisfactory results. Developing clustering algorithms for short gene expression profi les is necessary. In this thesis, two novel methods are developed for clustering short gene expression pro files. The fi rst method, called the network-based clustering method, deals with the defect of short gene expression profi les by generating a gene co-expression network using conditional mutual information (CMI), which measures the non-linear relationship between two genes, as well as considering indirect gene relationships in the presence of other genes. The network-based clustering method consists of two steps. A gene co-expression network is firstly constructed from short gene expression profi les using a path consistency algorithm (PCA) based on the CMI between genes. Then, a gene functional module is identi ed in terms of cluster cohesiveness. The network-based clustering method is evaluated on 10 large scale Arabidopsis thaliana short time-course gene expression profi le datasets in terms of gene ontology (GO) enrichment analysis, and compared with an existing method called Clustering with Over-lapping Neighbourhood Expansion (ClusterONE). Gene functional modules identi ed by the network-based clustering method for 10 datasets returns target GO p-values as low as 10-24, whereas the original ClusterONE yields insigni cant results. In order to more speci cally cluster gene expression profi les, a second clustering method, namely the protein-protein interaction (PPI) integrated clustering method, is developed. It is designed for clustering short gene expression profi les by integrating gene expression profi le patterns and curated PPI data. The method consists of the three following steps: (1) generate a number of prede ned profi le patterns according to the number of data points in the profi les and assign each gene to the prede fined profi le to which its expression profi le is the most similar; (2) integrate curated PPI data to refi ne the initial clustering result from (1); (3) combine the similar clusters from (2) to gradually reduce cluster numbers by a hierarchical clustering method. The PPI-integrated clustering method is evaluated on 10 large scale A. thaliana datasets using GO enrichment analysis, and by comparison with an existing method called Short Time-series Expression Miner (STEM). Target gene functional clusters identi ed by the PPI-integrated clustering method for 10 datasets returns GO p-values as low as 10-62, whereas STEM returns GO p-values as low as 10-38. In addition to the method development, obtained clusters by two proposed methods are further analyzed to identify cross-talk genes under fi ve stress conditions in root and shoot tissues. A list of potential abiotic stress tolerant genes are found

    A method to identify differential expression profiles of time-course gene data with Fourier transformation

    Get PDF
    BACKGROUND: Time course gene expression experiments are an increasingly popular method for exploring biological processes. Temporal gene expression profiles provide an important characterization of gene function, as biological systems are both developmental and dynamic. With such data it is possible to study gene expression changes over time and thereby to detect differential genes. Much of the early work on analyzing time series expression data relied on methods developed originally for static data and thus there is a need for improved methodology. Since time series expression is a temporal process, its unique features such as autocorrelation between successive points should be incorporated into the analysis. RESULTS: This work aims to identify genes that show different gene expression profiles across time. We propose a statistical procedure to discover gene groups with similar profiles using a nonparametric representation that accounts for the autocorrelation in the data. In particular, we first represent each profile in terms of a Fourier basis, and then we screen out genes that are not differentially expressed based on the Fourier coefficients. Finally, we cluster the remaining gene profiles using a model-based approach in the Fourier domain. We evaluate the screening results in terms of sensitivity, specificity, FDR and FNR, compare with the Gaussian process regression screening in a simulation study and illustrate the results by application to yeast cell-cycle microarray expression data with alpha-factor synchronization. The key elements of the proposed methodology: (i) representation of gene profiles in the Fourier domain; (ii) automatic screening of genes based on the Fourier coefficients and taking into account autocorrelation in the data, while controlling the false discovery rate (FDR); (iii) model-based clustering of the remaining gene profiles. CONCLUSIONS: Using this method, we identified a set of cell-cycle-regulated time-course yeast genes. The proposed method is general and can be potentially used to identify genes which have the same patterns or biological processes, and help facing the present and forthcoming challenges of data analysis in functional genomics

    DISCLOSE : DISsection of CLusters Obtained by SEries of transcriptome data using functional annotations and putative transcription factor binding sites

    Get PDF
    Background: A typical step in the analysis of gene expression data is the determination of clusters of genes that exhibit similar expression patterns. Researchers are confronted with the seemingly arbitrary choice between numerous algorithms to perform cluster analysis. Results: We developed an exploratory application that benchmarks the results of clustering methods using functional annotations. In addition, a de novo DNA motif discovery algorithm is integrated in our program which identifies overrepresented DNA binding sites in the upstream DNA sequences of genes from the clusters that are indicative of sites of transcriptional control. The performance of our program was evaluated by comparing the original results of a time course experiment with the findings of our application. Conclusion: DISCLOSE assists researchers in the prokaryotic research community in systematically evaluating results of the application of a range of clustering algorithms to transcriptome data. Different performance measures allow to quickly and comprehensively determine the best suited clustering approach for a given dataset.

    Comparison of Clustering Methods for Time Course Genomic Data: Applications to Aging Effects

    Full text link
    Time course microarray data provide insight about dynamic biological processes. While several clustering methods have been proposed for the analysis of these data structures, comparison and selection of appropriate clustering methods are seldom discussed. We compared 33 probabilistic based clustering methods and 33 distance based clustering methods for time course microarray data. Among probabilistic methods, we considered: smoothing spline clustering also known as model based functional data analysis (MFDA), functional clustering models for sparsely sampled data (FCM) and model-based clustering (MCLUST). Among distance based methods, we considered: weighted gene co-expression network analysis (WGCNA), clustering with dynamic time warping distance (DTW) and clustering with autocorrelation based distance (ACF). We studied these algorithms in both simulated settings and case study data. Our investigations showed that FCM performed very well when gene curves were short and sparse. DTW and WGCNA performed well when gene curves were medium or long (>=10>=10 observations). SSC performed very well when there were clusters of gene curves similar to one another. Overall, ACF performed poorly in these applications. In terms of computation time, FCM, SSC and DTW were considerably slower than MCLUST and WGCNA. WGCNA outperformed MCLUST by generating more accurate and biological meaningful clustering results. WGCNA and MCLUST are the best methods among the 6 methods compared, when performance and computation time are both taken into account. WGCNA outperforms MCLUST, but MCLUST provides model based inference and uncertainty measure of clustering results

    Partial mixture model for tight clustering of gene expression time-course

    Get PDF
    Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored. Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms. Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion

    Penalized Clustering of Large Scale Functional Data with Multiple Covariates

    Full text link
    In this article, we propose a penalized clustering method for large scale data with multiple covariates through a functional data approach. In the proposed method, responses and covariates are linked together through nonparametric multivariate functions (fixed effects), which have great flexibility in modeling a variety of function features, such as jump points, branching, and periodicity. Functional ANOVA is employed to further decompose multivariate functions in a reproducing kernel Hilbert space and provide associated notions of main effect and interaction. Parsimonious random effects are used to capture various correlation structures. The mixed-effect models are nested under a general mixture model, in which the heterogeneity of functional data is characterized. We propose a penalized Henderson's likelihood approach for model-fitting and design a rejection-controlled EM algorithm for the estimation. Our method selects smoothing parameters through generalized cross-validation. Furthermore, the Bayesian confidence intervals are used to measure the clustering uncertainty. Simulation studies and real-data examples are presented to investigate the empirical performance of the proposed method. Open-source code is available in the R package MFDA

    Bayesian correlated clustering to integrate multiple datasets

    Get PDF
    Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct – but often complementary – information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of 6 artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. cerevisiae datasets. In the 2-dataset case, we show that MDI’s performance is comparable to the present state of the art. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques – as well as to non-integrative approaches – demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods

    A temporal precedence based clustering method for gene expression microarray data

    Get PDF
    Background: Time-course microarray experiments can produce useful data which can help in understanding the underlying dynamics of the system. Clustering is an important stage in microarray data analysis where the data is grouped together according to certain characteristics. The majority of clustering techniques are based on distance or visual similarity measures which may not be suitable for clustering of temporal microarray data where the sequential nature of time is important. We present a Granger causality based technique to cluster temporal microarray gene expression data, which measures the interdependence between two time-series by statistically testing if one time-series can be used for forecasting the other time-series or not. Results: A gene-association matrix is constructed by testing temporal relationships between pairs of genes using the Granger causality test. The association matrix is further analyzed using a graph-theoretic technique to detect highly connected components representing interesting biological modules. We test our approach on synthesized datasets and real biological datasets obtained for Arabidopsis thaliana. We show the effectiveness of our approach by analyzing the results using the existing biological literature. We also report interesting structural properties of the association network commonly desired in any biological system. Conclusions: Our experiments on synthesized and real microarray datasets show that our approach produces encouraging results. The method is simple in implementation and is statistically traceable at each step. The method can produce sets of functionally related genes which can be further used for reverse-engineering of gene circuits
    • …
    corecore