576 research outputs found

    OptCluster : an R package for determining the optimal clustering algorithm and optimal number of clusters.

    Get PDF
    Determining the best clustering algorithm and ideal number of clusters for a particular dataset is a fundamental difficulty in unsupervised clustering analysis. In biological research, data generated from Next Generation Sequencing technology and microarray gene expression data are becoming more and more common, so new tools and resources are needed to group such high dimensional data using clustering analysis. Different clustering algorithms can group data very differently. Therefore, there is a need to determine the best groupings in a given dataset using the most suitable clustering algorithm for that data. This paper presents the R package optCluster as an efficient way for users to evaluate up to ten clustering algorithms, ultimately determining the optimal algorithm and optimal number of clusters for a given set of data. The selected clustering algorithms are evaluated by as many as nine validation measures classified as “biological”, “internal”, or “stability”, and the final result is obtained through a weighted rank aggregation algorithm based on the calculated validation scores. Two examples using this package are presented, one with a microarray dataset and the other with an RNA-Seq dataset. These two examples highlight the capabilities the optCluster package and demonstrate its usefulness as a tool in cluster analysis

    Decloud: an unsupervised deconvolution tool for building gene expression profiles

    Get PDF
    Deconvolution is the process of decomposing a mixed signal into its originating elements. For my thesis I created a clustering application, named DeCloud, with the intent to replace the unsupervised clustering step in the deconvolution tool, Deblender. Utilizing clustering packages in R such as optCluster, the application was built to allow for a range of new clustering algorithms. In this thesis the scope has been set to test Hierarchical clustering and two variations of PAM. A novel filtering function was created, providing a different approach to handling clusters. The novel approach has been implemented for use with the PAM clustering method, but could be applied to other algorithms as well. We have tested the resulting pipeline on the data sets used to benchmark Deblender and other tools. Comparing the results obtained by Deblender and by DeCloud, shows that DeCloud obtains marked better results on two of the three datasets used for testing. The last dataset is a complicated case, none of the applications are able to effectively cluster and deconvolve. The novel filter function applied to the PAM algorithm has been shown to be the best performer in each of the two successful deconvolution datasets.Master's Thesis in InformaticsINF39

    Genetic Algorithms Applied to Multi-Class Clustering for Gene Expression Data

    Get PDF
    A hybrid GA (genetic algorithm)-based clustering (HGACLUS) schema, combining merits of the Simulated Annealing, was described for finding an optimal or near-optimal set of medoids. This schema maximized the clustering success by achieving internal cluster cohesion and external cluster isolation. The performance of HGACLUS and other methods was compared by using simulated data and open microarray gene-expression datasets. HGACLUS was generally found to be more accurate and robust than other methods discussed in this paper by the exact validation strategy and the explicit cluster number

    Evolutionary framework for DNA Microarry Cluster Analysis

    Get PDF
    En esta investigación se propone un framework evolutivo donde se fusionan un método de clustering jerárquico basado en un modelo evolutivo, un conjunto de medidas de validación de agrupamientos (clusters) de datos y una herramienta de visualización de clusterings. El objetivo es crear un marco apropiado para la extracción de conocimiento a partir de datos provenientes de DNA-microarrays. Por una parte, el modelo evolutivo de clustering de nuestro framework es una alternativa novedosa que intenta resolver algunos de los problemas presentes en los métodos de clustering existentes. Por otra parte, nuestra alternativa de visualización de clusterings, materializada en una herramienta, incorpora nuevas propiedades y nuevos componentes de visualización, lo cual permite validar y analizar los resultados de la tarea de clustering. De este modo, la integración del modelo evolutivo de clustering con el modelo visual de clustering, convierta a nuestro framework evolutivo en una aplicación novedosa de minería de datos frente a los métodos convencionales

    Evaluation of statistical correlation and validation methods for construction of gene co-expression networks

    Get PDF
    High-throughput technologies such as microarrays have led to the rapid accumulation of large scale genomic data providing opportunities to systematically infer gene function and co-expression networks. Typical steps of co-expression network analysis using microarray data consist of estimation of pair-wise gene co-expression using some similarity measure, construction of co-expression networks, identification of clusters of co-expressed genes and post-cluster analyses such as cluster validation. This dissertation is primarily concerned with development and evaluation of approaches for the first and the last steps – estimation of gene co-expression matrices and validation of network clusters. Since clustering methods are not a focus, only a paraclique clustering algorithm will be used in this evaluation. First, a novel Bayesian approach is presented for combining the Pearson correlation with prior biological information from Gene Ontology, yielding a biologically relevant estimate of gene co-expression. The addition of biological information by the Bayesian approach reduced noise in the paraclique gene clusters as indicated by high silhouette and increased homogeneity of clusters in terms of molecular function. Standard similarity measures including correlation coefficients from Pearson, Spearman, Kendall’s Tau, Shrinkage, Partial, and Mutual information, and Euclidean and Manhattan distance measures were evaluated. Based on quality metrics such as cluster homogeneity and stability with respect to ontological categories, clusters resulting from partial correlation and mutual information were more biologically relevant than those from any other correlation measures. Second, statistical quality of clusters was evaluated using approaches based on permutation tests and Mantel correlation to identify significant and informative clusters that capture most of the covariance in the dataset. Third, the utility of statistical contrasts was studied for classification of temporal patterns of gene expression. Specifically, polynomial and Helmert contrast analyses were shown to provide a means of labeling the co-expressed gene sets because they showed similar temporal profiles

    Multi-Objective Differential Evolution for Automatic Clustering with Application to Micro-Array Data Analysis

    Get PDF
    This paper applies the Differential Evolution (DE) algorithm to the task of automatic fuzzy clustering in a Multi-objective Optimization (MO) framework. It compares the performances of two multi-objective variants of DE over the fuzzy clustering problem, where two conflicting fuzzy validity indices are simultaneously optimized. The resultant Pareto optimal set of solutions from each algorithm consists of a number of non-dominated solutions, from which the user can choose the most promising ones according to the problem specifications. A real-coded representation of the search variables, accommodating variable number of cluster centers, is used for DE. The performances of the multi-objective DE-variants have also been contrasted to that of two most well-known schemes of MO clustering, namely the Non Dominated Sorting Genetic Algorithm (NSGA II) and Multi-Objective Clustering with an unknown number of Clusters K (MOCK). Experimental results using six artificial and four real life datasets of varying range of complexities indicate that DE holds immense promise as a candidate algorithm for devising MO clustering schemes

    An Experimental Study on Microarray Expression Data from Plants under Salt Stress by using Clustering Methods

    Get PDF
    Current Genome-wide advancements in Gene chips technology provide in the “Omics (genomics, proteomics and transcriptomics) research”, an opportunity to analyze the expression levels of thousand of genes across multiple experiments. In this regard, many machine learning approaches were proposed to deal with this deluge of information. Clustering methods are one of these approaches. Their process consists of grouping data (gene profiles) into homogeneous clusters using distance measurements. Various clustering techniques are applied, but there is no consensus for the best one. In this context, a comparison of seven clustering algorithms was performed and tested against the gene expression datasets of three model plants under salt stress. These techniques are evaluated by internal and relative validity measures. It appears that the AGNES algorithm is the best one for internal validity measures for the three plant datasets. Also, K-Means profiles a trend for relative validity measures for these datasets
    corecore