60,868 research outputs found

    CONGO: Clustering on the Gene Ontology

    Full text link
    With the invention of microarray technology, researchers are capable of measuring the expression levels of ten thousands of genes in parallel at various time points of the biological process. During the investigation of gene regulatory networks and general cellular mechanisms, biologists are attempting to group genes based on the time-depending pattern of the obtained expression levels. In this paper, we propose a new memetic algorithm - a genetic algorithm combined with local search-based on a tree representation of the data - a minimum spanning tree minus; for clustering gene expression data. The combination of both concepts is shown to find near-optimal solutions quickly. Due to the minimum spanning tree representation of the data, our algorithm is capable of finding clusters of different shapes. We show that our approach is superior in solution quality compared to classical clustering methods

    Iterative class discovery and feature selection using Minimal Spanning Trees

    Get PDF
    BACKGROUND: Clustering is one of the most commonly used methods for discovering hidden structure in microarray gene expression data. Most current methods for clustering samples are based on distance metrics utilizing all genes. This has the effect of obscuring clustering in samples that may be evident only when looking at a subset of genes, because noise from irrelevant genes dominates the signal from the relevant genes in the distance calculation. RESULTS: We describe an algorithm for automatically detecting clusters of samples that are discernable only in a subset of genes. We use iteration between Minimal Spanning Tree based clustering and feature selection to remove noise genes in a step-wise manner while simultaneously sharpening the clustering. Evaluation of this algorithm on synthetic data shows that it resolves planted clusters with high accuracy in spite of noise and the presence of other clusters. It also shows a low probability of detecting spurious clusters. Testing the algorithm on some well known micro-array data-sets reveals known biological classes as well as novel clusters. CONCLUSIONS: The iterative clustering method offers considerable improvement over clustering in all genes. This method can be used to discover partitions and their biological significance can be determined by comparing with clinical correlates and gene annotations. The MATLAB(© )programs for the iterative clustering algorithm are available fro

    HAMSTER: visualizing microarray experiments as a set of minimum spanning trees

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Visualization tools allow researchers to obtain a global view of the interrelationships between the probes or experiments of a gene expression (<it>e.g. microarray</it>) data set. Some existing methods include hierarchical clustering and k-means. In recent years, others have proposed applying minimum spanning trees (MST) for microarray clustering. Although MST-based clustering is formally equivalent to the dendrograms produced by hierarchical clustering under certain conditions; visually they can be quite different.</p> <p>Methods</p> <p>HAMSTER (Helpful Abstraction using Minimum Spanning Trees for Expression Relations) is an open source system for generating a <b>set </b>of MSTs from the experiments of a microarray data set. While previous works have generated a single MST from a data set for data clustering, we recursively merge experiments and repeat this process to obtain a set of MSTs for data visualization. Depending on the parameters chosen, each tree is analogous to a snapshot of one step of the hierarchical clustering process. We scored and ranked these trees using one of three proposed schemes. HAMSTER is implemented in C++ and makes use of Graphviz for laying out each MST.</p> <p>Results</p> <p>We report on the running time of HAMSTER and demonstrate using data sets from the NCBI Gene Expression Omnibus (GEO) that the images created by HAMSTER offer insights that differ from the dendrograms of hierarchical clustering. In addition to the C++ program which is available as open source, we also provided a web-based version (HAMSTER<sup>+</sup>) which allows users to apply our system through a web browser without any computer programming knowledge.</p> <p>Conclusion</p> <p>Researchers may find it helpful to include HAMSTER in their microarray analysis workflow as it can offer insights that differ from hierarchical clustering. We believe that HAMSTER would be useful for certain types of gradient data sets (e.g time-series data) and data that indicate relationships between cells/tissues. Both the source and the web server variant of HAMSTER are available from <url>http://hamster.cbrc.jp/</url>.</p

    Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics

    Get PDF
    Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites. google.com/site/gaussianbhc

    Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping

    Full text link
    We consider the problem of estimating a sparse multi-response regression function, with an application to expression quantitative trait locus (eQTL) mapping, where the goal is to discover genetic variations that influence gene-expression levels. In particular, we investigate a shrinkage technique capable of capturing a given hierarchical structure over the responses, such as a hierarchical clustering tree with leaf nodes for responses and internal nodes for clusters of related responses at multiple granularity, and we seek to leverage this structure to recover covariates relevant to each hierarchically-defined cluster of responses. We propose a tree-guided group lasso, or tree lasso, for estimating such structured sparsity under multi-response regression by employing a novel penalty function constructed from the tree. We describe a systematic weighting scheme for the overlapping groups in the tree-penalty such that each regression coefficient is penalized in a balanced manner despite the inhomogeneous multiplicity of group memberships of the regression coefficients due to overlaps among groups. For efficient optimization, we employ a smoothing proximal gradient method that was originally developed for a general class of structured-sparsity-inducing penalties. Using simulated and yeast data sets, we demonstrate that our method shows a superior performance in terms of both prediction errors and recovery of true sparsity patterns, compared to other methods for learning a multivariate-response regression.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS549 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    Seeing the Forest for the Trees: Using the Gene Ontology to Restructure Hierarchical Clustering

    Get PDF
    Motivation: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity. Results: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein–protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein–protein interaction data. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.Lynne and William Frankel Center for Computer Science; Paul Ivanier center for robotics research and production; National Institutes of Health (R01 HG003367-01A1
    corecore