35,806 research outputs found

    SGC: A semi-supervised pipeline for gene clustering using self-training approach in gene co-expression networks

    Full text link
    A widely used approach for extracting information from gene expression data employ the construction of a gene co-expression network and the subsequent application of algorithms that discover network structure. In particular, a common goal is the computational discovery of gene clusters, commonly called modules. When applied on a novel gene expression dataset, the quality of the computed modules can be evaluated automatically, using Gene Ontology enrichment, a method that measures the frequencies of Gene Ontology terms in the computed modules and evaluates their statistical likelihood. In this work we propose SGC a novel pipeline for gene clustering based on relatively recent seminal work in the mathematics of spectral network theory. SGC consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology information in a semi-supervised clustering method that further improves the quality of the computed modules. Comparing with already well-known existing frameworks, we show that SGC results in higher enrichment in real data. In particular, in 12 real gene expression datasets, SGC outperforms in all except one

    A new unsupervised gene clustering algorithm based on the integration of biological knowledge into expression data

    Get PDF
    BACKGROUND: Gene clustering algorithms are massively used by biologists when analysing omics data. Classical gene clustering strategies are based on the use of expression data only, directly as in Heatmaps, or indirectly as in clustering based on coexpression networks for instance. However, the classical strategies may not be sufficient to bring out all potential relationships amongst genes. RESULTS: We propose a new unsupervised gene clustering algorithm based on the integration of external biological knowledge, such as Gene Ontology annotations, into expression data. We introduce a new distance between genes which consists in integrating biological knowledge into the analysis of expression data. Therefore, two genes are close if they have both similar expression profiles and similar functional profiles at once. Then a classical algorithm (e.g. K-means) is used to obtain gene clusters. In addition, we propose an automatic evaluation procedure of gene clusters. This procedure is based on two indicators which measure the global coexpression and biological homogeneity of gene clusters. They are associated with hypothesis testing which allows to complement each indicator with a p-value. Our clustering algorithm is compared to the Heatmap clustering and the clustering based on gene coexpression network, both on simulated and real data. In both cases, it outperforms the other methodologies as it provides the highest proportion of significantly coexpressed and biologically homogeneous gene clusters, which are good candidates for interpretation. CONCLUSION: Our new clustering algorithm provides a higher proportion of good candidates for interpretation. Therefore, we expect the interpretation of these clusters to help biologists to formulate new hypothesis on the relationships amongst genes

    ECgene: an alternative splicing database update

    Get PDF
    ECgene () was developed to provide functional annotation for alternatively spliced genes. The applications encompass the genome-based transcript modeling for alternative splicing (AS), domain analysis with Gene Ontology (GO) annotation and expression analysis based on the EST and SAGE data. We have expanded the ECgene's AS modeling and EST clustering to nine organisms for which sufficient EST data are available in the GenBank. As for the human genome, we have also introduced several new applications to analyze differential expression. ECprofiler is an ontology-based candidate gene search system that allows users to select an arbitrary combination of gene expression pattern and GO functional categories. DEGEST is a database of differentially expressed genes and isoforms based on the EST information. Importantly, gene expression is analyzed at three distinctive levels—gene, isoform and exon levels. The user interfaces for functional and expression analyses have been substantially improved. ASviewer is a dedicated java application that visualizes the transcript structure and functional features of alternatively spliced variants. The SAGE part of the expression module provides many additional features including SNP, differential expression and alternative tag positions

    Two Novel Methods for Clustering Short Time-Course Gene Expression Profiles

    Get PDF
    As genes with similar expression pattern are very likely having the same biological function, cluster analysis becomes an important tool to understand and predict gene functions from gene expression profi les. In many situations, each gene expression profi le only contains a few data points. Directly applying traditional clustering algorithms to such short gene expression profi les does not yield satisfactory results. Developing clustering algorithms for short gene expression profi les is necessary. In this thesis, two novel methods are developed for clustering short gene expression pro files. The fi rst method, called the network-based clustering method, deals with the defect of short gene expression profi les by generating a gene co-expression network using conditional mutual information (CMI), which measures the non-linear relationship between two genes, as well as considering indirect gene relationships in the presence of other genes. The network-based clustering method consists of two steps. A gene co-expression network is firstly constructed from short gene expression profi les using a path consistency algorithm (PCA) based on the CMI between genes. Then, a gene functional module is identi ed in terms of cluster cohesiveness. The network-based clustering method is evaluated on 10 large scale Arabidopsis thaliana short time-course gene expression profi le datasets in terms of gene ontology (GO) enrichment analysis, and compared with an existing method called Clustering with Over-lapping Neighbourhood Expansion (ClusterONE). Gene functional modules identi ed by the network-based clustering method for 10 datasets returns target GO p-values as low as 10-24, whereas the original ClusterONE yields insigni cant results. In order to more speci cally cluster gene expression profi les, a second clustering method, namely the protein-protein interaction (PPI) integrated clustering method, is developed. It is designed for clustering short gene expression profi les by integrating gene expression profi le patterns and curated PPI data. The method consists of the three following steps: (1) generate a number of prede ned profi le patterns according to the number of data points in the profi les and assign each gene to the prede fined profi le to which its expression profi le is the most similar; (2) integrate curated PPI data to refi ne the initial clustering result from (1); (3) combine the similar clusters from (2) to gradually reduce cluster numbers by a hierarchical clustering method. The PPI-integrated clustering method is evaluated on 10 large scale A. thaliana datasets using GO enrichment analysis, and by comparison with an existing method called Short Time-series Expression Miner (STEM). Target gene functional clusters identi ed by the PPI-integrated clustering method for 10 datasets returns GO p-values as low as 10-62, whereas STEM returns GO p-values as low as 10-38. In addition to the method development, obtained clusters by two proposed methods are further analyzed to identify cross-talk genes under fi ve stress conditions in root and shoot tissues. A list of potential abiotic stress tolerant genes are found

    Comprehensive evaluation of matrix factorization methods for the analysis of DNA microarray gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Clustering-based methods on gene-expression analysis have been shown to be useful in biomedical applications such as cancer subtype discovery. Among them, Matrix factorization (MF) is advantageous for clustering gene expression patterns from DNA microarray experiments, as it efficiently reduces the dimension of gene expression data. Although several MF methods have been proposed for clustering gene expression patterns, a systematic evaluation has not been reported yet.</p> <p>Results</p> <p>Here we evaluated the clustering performance of orthogonal and non-orthogonal MFs by a total of nine measurements for performance in four gene expression datasets and one well-known dataset for clustering. Specifically, we employed a non-orthogonal MF algorithm, BSNMF (Bi-directional Sparse Non-negative Matrix Factorization), that applies bi-directional sparseness constraints superimposed on non-negative constraints, comprising a few dominantly co-expressed genes and samples together. Non-orthogonal MFs tended to show better clustering-quality and prediction-accuracy indices than orthogonal MFs as well as a traditional method, K-means. Moreover, BSNMF showed improved performance in these measurements. Non-orthogonal MFs including BSNMF showed also good performance in the functional enrichment test using Gene Ontology terms and biological pathways.</p> <p>Conclusions</p> <p>In conclusion, the clustering performance of orthogonal and non-orthogonal MFs was appropriately evaluated for clustering microarray data by comprehensive measurements. This study showed that non-orthogonal MFs have better performance than orthogonal MFs and <it>K</it>-means for clustering microarray data.</p

    Statistical inference from large-scale genomic data

    Get PDF
    This thesis explores the potential of statistical inference methodologies in their applications in functional genomics. In essence, it summarises algorithmic findings in this field, providing step-by-step analytical methodologies for deciphering biological knowledge from large-scale genomic data, mainly microarray gene expression time series. This thesis covers a range of topics in the investigation of complex multivariate genomic data. One focus involves using clustering as a method of inference and another is cluster validation to extract meaningful biological information from the data. Information gained from the application of these various techniques can then be used conjointly in the elucidation of gene regulatory networks, the ultimate goal of this type of analysis. First, a new tight clustering method for gene expression data is proposed to obtain tighter and potentially more informative gene clusters. Next, to fully utilise biological knowledge in clustering validation, a validity index is defined based on one of the most important ontologies within the Bioinformatics community, Gene Ontology. The method bridges a gap in current literature, in the sense that it takes into account not only the variations of Gene Ontology categories in biological specificities and their significance to the gene clusters, but also the complex structure of the Gene Ontology. Finally, Bayesian probability is applied to making inference from heterogeneous genomic data, integrated with previous efforts in this thesis, for the aim of large-scale gene network inference. The proposed system comes with a stochastic process to achieve robustness to noise, yet remains efficient enough for large-scale analysis. Ultimately, the solutions presented in this thesis serve as building blocks of an intelligent system for interpreting large-scale genomic data and understanding the functional organisation of the genome

    Partial mixture model for tight clustering of gene expression time-course

    Get PDF
    Background: Tight clustering arose recently from a desire to obtain tighter and potentially more informative clusters in gene expression studies. Scattered genes with relatively loose correlations should be excluded from the clusters. However, in the literature there is little work dedicated to this area of research. On the other hand, there has been extensive use of maximum likelihood techniques for model parameter estimation. By contrast, the minimum distance estimator has been largely ignored. Results: In this paper we show the inherent robustness of the minimum distance estimator that makes it a powerful tool for parameter estimation in model-based time-course clustering. To apply minimum distance estimation, a partial mixture model that can naturally incorporate replicate information and allow scattered genes is formulated. We provide experimental results of simulated data fitting, where the minimum distance estimator demonstrates superior performance to the maximum likelihood estimator. Both biological and statistical validations are conducted on a simulated dataset and two real gene expression datasets. Our proposed partial regression clustering algorithm scores top in Gene Ontology driven evaluation, in comparison with four other popular clustering algorithms. Conclusion: For the first time partial mixture model is successfully extended to time-course data analysis. The robustness of our partial regression clustering algorithm proves the suitability of the ombination of both partial mixture model and minimum distance estimator in this field. We show that tight clustering not only is capable to generate more profound understanding of the dataset under study well in accordance to established biological knowledge, but also presents interesting new hypotheses during interpretation of clustering results. In particular, we provide biological evidences that scattered genes can be relevant and are interesting subjects for study, in contrast to prevailing opinion

    VisHiC—hierarchical functional enrichment analysis of microarray data

    Get PDF
    Measuring gene expression levels with microarrays is one of the key technologies of modern genomics. Clustering of microarray data is an important application, as genes with similar expression profiles may be regulated by common pathways and involved in related functions. Gene Ontology (GO) analysis and visualization allows researchers to study the biological context of discovered clusters and characterize genes with previously unknown functions. We present VisHiC (Visualization of Hierarchical Clustering), a web server for clustering and compact visualization of gene expression data combined with automated function enrichment analysis. The main output of the analysis is a dendrogram and visual heatmap of the expression matrix that highlights biologically relevant clusters based on enriched GO terms, pathways and regulatory motifs. Clusters with most significant enrichments are contracted in the final visualization, while less relevant parts are hidden altogether. Such a dense representation of microarray data gives a quick global overview of thousands of transcripts in many conditions and provides a good starting point for further analysis. VisHiC is freely available at http://biit.cs.ut.ee/vishic

    Array2BIO: from microarray expression data to functional annotation of co-regulated genes

    Get PDF
    BACKGROUND: There are several isolated tools for partial analysis of microarray expression data. To provide an integrative, easy-to-use and automated toolkit for the analysis of Affymetrix microarray expression data we have developed Array2BIO, an application that couples several analytical methods into a single web based utility. RESULTS: Array2BIO converts raw intensities into probe expression values, automatically maps those to genes, and subsequently identifies groups of co-expressed genes using two complementary approaches: (1) comparative analysis of signal versus control and (2) clustering analysis of gene expression across different conditions. The identified genes are assigned to functional categories based on Gene Ontology classification and KEGG protein interaction pathways. Array2BIO reliably handles low-expressor genes and provides a set of statistical methods for quantifying expression levels, including Benjamini-Hochberg and Bonferroni multiple testing corrections. An automated interface with the ECR Browser provides evolutionary conservation analysis for the identified gene loci while the interconnection with Crème allows prediction of gene regulatory elements that underlie observed expression patterns. CONCLUSION: We have developed Array2BIO – a web based tool for rapid comprehensive analysis of Affymetrix microarray expression data, which also allows users to link expression data to Dcode.org comparative genomics tools and integrates a system for translating co-expression data into mechanisms of gene co-regulation. Array2BIO is publicly available a
    corecore