43,051 research outputs found

    Clustering of Gene Expression Data Based on Shape Similarity

    Get PDF
    A method for gene clustering from expression profiles using shape information is presented. The conventional clustering approaches such as K-means assume that genes with similar functions have similar expression levels and hence allocate genes with similar expression levels into the same cluster. However, genes with similar function often exhibit similarity in signal shape even though the expression magnitude can be far apart. Therefore, this investigation studies clustering according to signal shape similarity. This shape information is captured in the form of normalized and time-scaled forward first differences, which then are subject to a variational Bayes clustering plus a non-Bayesian (Silhouette) cluster statistic. The statistic shows an improved ability to identify the correct number of clusters and assign the components of cluster. Based on initial results for both generated test data and Escherichia coli microarray expression data and initial validation of the Escherichia coli results, it is shown that the method has promise in being able to better cluster time-series microarray data according to shape similarity

    Gene expression data analysis using novel methods: Predicting time delayed correlations and evolutionarily conserved functional modules

    Get PDF
    Microarray technology enables the study of gene expression on a large scale. One of the main challenges has been to devise methods to cluster genes that share similar expression profiles. In gene expression time courses, a particular gene may encode transcription factor and thus controlling several genes downstream; in this case, the gene expression profiles may be staggered, indicating a time-delayed response in transcription of the later genes. The standard clustering algorithms consider gene expression profiles in a global way, thus often ignoring such local time-delayed correlations. We have developed novel methods to capture time-delayed correlations between expression profiles: (1) A method using dynamic programming and (2) CLARITY, an algorithm that uses a local shape based similarity measure to predict time-delayed correlations and local correlations. We used CLARITY on a dataset describing the change in gene expression during the mitotic cell cycle in Saccharomyces cerevisiae. The obtained clusters were significantly enriched with genes that share similar functions, reflecting the fact that genes with a similar function are often co-regulated and thus co-expressed. Time-shifted as well as local correlations could also be predicted using CLARITY. In datasets, where the expression profiles of independent experiments are compared, the standard clustering algorithms often cluster according to all conditions, considering all genes. This increases the background noise and can lead to the missing of genes that change the expression only under particular conditions. We have employed a genetic algorithm based module predictor that is capable to identify group of genes that change their expression only in a subset of conditions. With the aim of supplementing the Ustilago maydis genome annotation, we have used the module prediction algorithm on various independent datasets from Ustilago maydis. The predicted modules were cross-referenced in various Saccharomyces cerevisiae datasets to check its evolutionarily conservation between these two organisms. The key contributions of this thesis are novel methods that explore biological information from DNA microarray data

    Gene expression data analysis using novel methods: Predicting time delayed correlations and evolutionarily conserved functional modules

    Get PDF
    Microarray technology enables the study of gene expression on a large scale. One of the main challenges has been to devise methods to cluster genes that share similar expression profiles. In gene expression time courses, a particular gene may encode transcription factor and thus controlling several genes downstream; in this case, the gene expression profiles may be staggered, indicating a time-delayed response in transcription of the later genes. The standard clustering algorithms consider gene expression profiles in a global way, thus often ignoring such local time-delayed correlations. We have developed novel methods to capture time-delayed correlations between expression profiles: (1) A method using dynamic programming and (2) CLARITY, an algorithm that uses a local shape based similarity measure to predict time-delayed correlations and local correlations. We used CLARITY on a dataset describing the change in gene expression during the mitotic cell cycle in Saccharomyces cerevisiae. The obtained clusters were significantly enriched with genes that share similar functions, reflecting the fact that genes with a similar function are often co-regulated and thus co-expressed. Time-shifted as well as local correlations could also be predicted using CLARITY. In datasets, where the expression profiles of independent experiments are compared, the standard clustering algorithms often cluster according to all conditions, considering all genes. This increases the background noise and can lead to the missing of genes that change the expression only under particular conditions. We have employed a genetic algorithm based module predictor that is capable to identify group of genes that change their expression only in a subset of conditions. With the aim of supplementing the Ustilago maydis genome annotation, we have used the module prediction algorithm on various independent datasets from Ustilago maydis. The predicted modules were cross-referenced in various Saccharomyces cerevisiae datasets to check its evolutionarily conservation between these two organisms. The key contributions of this thesis are novel methods that explore biological information from DNA microarray data

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    Learning Feature Weights for Density-Based Clustering

    Get PDF
    K-Means is the most popular and widely used clustering algorithm. This algorithm cannot recover non-spherical shape clusters in data sets. DBSCAN is arguably the most popular algorithm to recover arbitrary shape clusters; this is why this density-based clustering algorithm is of great interest to tackle its weaknesses. One issue of concern is that DBSCAN requires two parameters, and it cannot recover widely variable density clusters. The problem lies at the heart of this thesis is that during the clustering process DBSCAN takes all the available features and treats all the features equally regardless of their degree of relevance in the data set, which can have negative impacts. This thesis addresses the above problems by laying the foundation of the feature weighted density-based clustering. Specifically, the thesis introduces a densitybased clustering algorithm using reverse nearest neighbour, DBSCANR that require less parameter than DBSCAN for recovering clusters. DBSCANR is based on the insight that in real-world data sets the densities of arbitrary shape clusters to be recovered within a data set are very different from each other. The thesis extends DBSCANR to what is referred to as weighted DBSCANR, WDBSCANR by exploiting feature weighting technique to give the different level of relevance to the features in a data set. The thesis extends W-DBSCANR further by using the Minkowski metric so that the weight can be interpreted as feature re-scaling factors named MW-DBSCANR. Experiments on both artificial and realworld data sets demonstrate the superiority of our method over DBSCAN type algorithms. These weighted algorithms considerably reduce the impact of irrelevant features while recovering arbitrary shape clusters of different level of densities in a high-dimensional data set. Within this context, this thesis incorporates a popular algorithm, feature selection using feature similarity, FSFS into bothW-DBSCANR andMW-DBSCANR, to address the problem of feature selection. This unsupervised feature selection algorithm makes use of feature clustering and feature similarity to reduce the number of features in a data set. With a similar aim, exploiting the concept of feature similarity, the thesis introduces a method, density-based feature selection using feature similarity, DBFSFS to take density-based cluster structure into consideration for reducing the number of features in a data set. This thesis then applies the developed method to real-world high-dimensional gene expression data sets. DBFSFS improves the clustering recovery by substantially reducing the number of features from high-dimensional low sample size data sets

    Elucidation of Directionality for Co-Expressed Genes: Predicting Intra-Operon Termination Sites

    Full text link
    We present a novel framework for inferring regulatory and sequence-level information from gene co-expression networks. The key idea of our methodology is the systematic integration of network inference and network topological analysis approaches for uncovering biological insights. We determine the gene co-expression network of Bacillus subtilis using Affymetrix GeneChip time series data and show how the inferred network topology can be linked to sequence-level information hard-wired in the organism's genome. We propose a systematic way for determining the correlation threshold at which two genes are assessed to be co-expressed by using the clustering coefficient and we expand the scope of the gene co-expression network by proposing the slope ratio metric as a means for incorporating directionality on the edges. We show through specific examples for B. subtilis that by incorporating expression level information in addition to the temporal expression patterns, we can uncover sequence-level biological insights. In particular, we are able to identify a number of cases where (i) the co-expressed genes are part of a single transcriptional unit or operon and (ii) the inferred directionality arises due to the presence of intra-operon transcription termination sites.Comment: 7 pages, 8 figures, accepted in Bioinformatic

    Measuring Cluster Stability for Bayesian Nonparametrics Using the Linear Bootstrap

    Full text link
    Clustering procedures typically estimate which data points are clustered together, a quantity of primary importance in many analyses. Often used as a preliminary step for dimensionality reduction or to facilitate interpretation, finding robust and stable clusters is often crucial for appropriate for downstream analysis. In the present work, we consider Bayesian nonparametric (BNP) models, a particularly popular set of Bayesian models for clustering due to their flexibility. Because of its complexity, the Bayesian posterior often cannot be computed exactly, and approximations must be employed. Mean-field variational Bayes forms a posterior approximation by solving an optimization problem and is widely used due to its speed. An exact BNP posterior might vary dramatically when presented with different data. As such, stability and robustness of the clustering should be assessed. A popular mean to assess stability is to apply the bootstrap by resampling the data, and rerun the clustering for each simulated data set. The time cost is thus often very expensive, especially for the sort of exploratory analysis where clustering is typically used. We propose to use a fast and automatic approximation to the full bootstrap called the "linear bootstrap", which can be seen by local data perturbation. In this work, we demonstrate how to apply this idea to a data analysis pipeline, consisting of an MFVB approximation to a BNP clustering posterior of time course gene expression data. We show that using auto-differentiation tools, the necessary calculations can be done automatically, and that the linear bootstrap is a fast but approximate alternative to the bootstrap.Comment: 9 pages, NIPS 2017 Advances in Approximate Bayesian Inference Worksho
    corecore