39,090 research outputs found

    Clustering analysis for gene expression data: a methodological review

    Get PDF
    Clustering is one of most useful tools for the microarray gene expression data analysis. Although there have been many reviews and surveys in the literature, many good and effective clustering ideas have not been collected in a systematic way for some reasons. In this paper, we review five clustering families representing five clustering concepts rather than five algorithms. We also review some clustering validations and collect a list of benchmark gene expression datasets

    Clustering of gene expression data: performance and similarity analysis

    Get PDF
    BACKGROUND: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research. RESULTS: In this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms. CONCLUSION: HC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods

    A mathematical and computational framework for quantitative comparison and integration of large-scale gene expression data

    Get PDF
    Analysis of large-scale gene expression studies usually begins with gene clustering. A ubiquitous problem is that different algorithms applied to the same data inevitably give different results, and the differences are often substantial, involving a quarter or more of the genes analyzed. This raises a series of important but nettlesome questions: How are different clustering results related to each other and to the underlying data structure? Is one clustering objectively superior to another? Which differences, if any, are likely candidates to be biologically important? A systematic and quantitative way to address these questions is needed, together with an effective way to integrate and leverage expression results with other kinds of large-scale data and annotations. We developed a mathematical and computational framework to help quantify, compare, visualize and interactively mine clusterings. We show that by coupling confusion matrices with appropriate metrics (linear assignment and normalized mutual information scores), one can quantify and map differences between clusterings. A version of receiver operator characteristic analysis proved effective for quantifying and visualizing cluster quality and overlap. These methods, plus a flexible library of clustering algorithms, can be called from a new expandable set of software tools called CompClust 1.0 (). CompClust also makes it possible to relate expression clustering patterns to DNA sequence motif occurrences, protein–DNA interaction measurements and various kinds of functional annotations. Test analyses used yeast cell cycle data and revealed data structure not obvious under all algorithms. These results were then integrated with transcription motif and global protein–DNA interaction data to identify G(1) regulatory modules

    Optimization based clustering and classification algorithms in analysis of microarray gene expression data sets

    Get PDF
    Doctor of PhilosophyBioinformatics and computational biology are relatively new areas that involve the use of different techniques including computer science, informatics, biochemistry, applied math and etc., to solve biological problems. In recent years the development of new molecular genetics technologies, such as DNA microarrays led to the simultaneous measurement of expression levels of thousands and even tens of thousands of genes. Microarray gene expression technology has facilitated the study of genomic structure and investigation of biological systems. Numerical output of this technology is shown as microarray gene expression data sets. These data sets contain a very large number of genes and a relatively small number of samples and their precise analysis requires a robust and suitable computer software. Due to this, only a few existing algorithms are applicable to them, so more efficient methods for solving clustering, gene selection and classification problems of gene expression data sets are required and those methods need to be computationally applicable and less expensive. The aim of this thesis is to develop new algorithms for solving clustering, gene selection and data classification problems on gene expression data sets. Clustering in gene expression data sets is a challenging problem. The increasing use of DNA microarray-based tumour gene expression profiles for cancer diagnosis requires more efficient methods to solve clustering problems of these profiles. Different algorithms for clustering of genes have been proposed, however few algorithms can be applied to the clustering of samples. k-means algorithm, among very few clustering algorithms is applicable to microarray gene expression data sets, however these are not efficient for solving clustering problems when the number of genes is thousands and this algorithm is very sensitive to the choice of a starting point. Additionally, when the number of clusters is relatively large, this algorithm gives local minima which can differ significantly from the global solution. Over the last several years different approaches have been proposed to improve global ii Abstract Abstract search properties of k-means algorithm. One of them is the global k-means algorithm, however this algorithm is not efficient when data are sparse. In this thesis we developed a new version of the global k-means algorithm, the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. In a microarray gene expression data set, in many cases only a small fraction of genes are informative whereas most of them are non-informative and make noise. Therefore the development of gene selection algorithms that allow us to remove as many non-informative genes as possible is very important. In this thesis we developed a new overlapping gene selection algorithm. This algorithm is based on calculating overlaps of different genes. It considerably reduces the number of genes and is efficient in finding a subset of informative genes. Over the last decade different approaches have been proposed to solve supervised data classification problems in gene expression data sets. In this thesis we developed a new approach which is based on the so-called max-min separability and is compared with the other approaches. The max-min separability algorithm is an equivalent of piecewise linear separability. An incremental algorithm is presented to compute piecewise linear functions separating two sets. This algorithm is applied along with a special gene selection algorithm. In this thesis, all new algorithms have been tested on 10 publicly available gene expression data sets and our numerical results demonstrate the efficiency of the new algorithms that were developed in the framework of this researc

    Modified global k-means algorithm for clustering in gene expression data sets

    Get PDF
    Clustering in gene expression data sets is a challenging problem. Different algorithms for clustering of genes have been proposed. However due to the large number of genes only a few algorithms can be applied for the clustering of samples. k-means algorithm and its different variations are among those algorithms. But these algorithms in general can converge only to local minima and these local minima are significantly different from global solutions as the number of clusters increases. Over the last several years different approaches have been proposed to improve global search properties of k-means algorithm and its performance on large data sets. One of them is the global k-means algorithm. In this paper we develop a new version of the global k-means algorithm: the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. We present preliminary computational results using gene expression data sets which demonstrate that the modified k-means algorithm improves and sometimes significantly results by k-means and global k-means algorithms.E

    Gamma-based clustering via ordered means with application to gene-expression analysis

    Full text link
    Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study.Comment: Published in at http://dx.doi.org/10.1214/10-AOS805 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Clustering Approaches for Evaluation and Analysis on Formal Gene Expression Cancer Datasets

    Get PDF
    Enormous generation of biological data and the need of analysis of that data led to the generation of the field Bioinformatics. Data mining is the stream which is used to derive, analyze the data by exploring the hidden patterns of the biological data. Though, data mining can be used in analyzing biological data such as genomic data, proteomic data here Gene Expression (GE) Data is considered for evaluation. GE is generated from Microarrays such as DNA and oligo micro arrays. The generated data is analyzed through the clustering techniques of data mining. This study deals with an implement the basic clustering approach K-Means and various clustering approaches like Hierarchal, Som, Click and basic fuzzy based clustering approach. Eventually, the comparative study of those approaches which lead to the effective approach of cluster analysis of GE.The experimental results shows that proposed algorithm achieve a higher clustering accuracy and takes less clustering time when compared with existing algorithms

    AMIC@: All MIcroarray Clusterings @ once

    Get PDF
    The AMIC@ Web Server offers a light-weight multi-method clustering engine for microarray gene-expression data. AMIC@ is a highly interactive tool that stresses user-friendliness and robustness by adopting AJAX technology, thus allowing an effective interleaved execution of different clustering algorithms and inspection of results. Among the salient features AMIC@ offers, there are: (i) automatic file format detection, (ii) suggestions on the number of clusters using a variant of the stability-based method of Tibshirani et al. (iii) intuitive visual inspection of the data via heatmaps and (iv) measurements of the clustering quality using cluster homogeneity. Large data sets can be processed efficiently by selecting algorithms (such as FPF-SB and k-Boost), specifically designed for this purpose. In case of very large data sets, the user can opt for a batch-mode use of the system by means of the Clustering wizard that runs all algorithms at once and delivers the results via email. AMIC@ is freely available and open to all users with no login requirement at the following URL http://bioalgo.iit.cnr.it/amica