60,952 research outputs found

    Elucidation of Directionality for Co-Expressed Genes: Predicting Intra-Operon Termination Sites

    Full text link
    We present a novel framework for inferring regulatory and sequence-level information from gene co-expression networks. The key idea of our methodology is the systematic integration of network inference and network topological analysis approaches for uncovering biological insights. We determine the gene co-expression network of Bacillus subtilis using Affymetrix GeneChip time series data and show how the inferred network topology can be linked to sequence-level information hard-wired in the organism's genome. We propose a systematic way for determining the correlation threshold at which two genes are assessed to be co-expressed by using the clustering coefficient and we expand the scope of the gene co-expression network by proposing the slope ratio metric as a means for incorporating directionality on the edges. We show through specific examples for B. subtilis that by incorporating expression level information in addition to the temporal expression patterns, we can uncover sequence-level biological insights. In particular, we are able to identify a number of cases where (i) the co-expressed genes are part of a single transcriptional unit or operon and (ii) the inferred directionality arises due to the presence of intra-operon transcription termination sites.Comment: 7 pages, 8 figures, accepted in Bioinformatic

    Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data

    Full text link
    Three-way data structures, characterized by three entities, the units, the variables and the occasions, are frequent in biological studies. In RNA sequencing, three-way data structures are obtained when high-throughput transcriptome sequencing data are collected for n genes across p conditions at r occasions. Matrix-variate distributions offer a natural way to model three-way data and mixtures of matrix-variate distributions can be used to cluster three-way data. Clustering of gene expression data is carried out as means to discovering gene co-expression networks. In this work, a mixture of matrix-variate Poisson-log normal distributions is proposed for clustering read counts from RNA sequencing. By considering the matrix-variate structure, full information on the conditions and occasions of the RNA sequencing dataset is simultaneously considered, and the number of covariance parameters to be estimated is reduced. A Markov chain Monte Carlo expectation-maximization algorithm is used for parameter estimation and information criteria are used for model selection. The models are applied to both real and simulated data, giving favourable clustering results

    Evaluation of statistical correlation and validation methods for construction of gene co-expression networks

    Get PDF
    High-throughput technologies such as microarrays have led to the rapid accumulation of large scale genomic data providing opportunities to systematically infer gene function and co-expression networks. Typical steps of co-expression network analysis using microarray data consist of estimation of pair-wise gene co-expression using some similarity measure, construction of co-expression networks, identification of clusters of co-expressed genes and post-cluster analyses such as cluster validation. This dissertation is primarily concerned with development and evaluation of approaches for the first and the last steps – estimation of gene co-expression matrices and validation of network clusters. Since clustering methods are not a focus, only a paraclique clustering algorithm will be used in this evaluation. First, a novel Bayesian approach is presented for combining the Pearson correlation with prior biological information from Gene Ontology, yielding a biologically relevant estimate of gene co-expression. The addition of biological information by the Bayesian approach reduced noise in the paraclique gene clusters as indicated by high silhouette and increased homogeneity of clusters in terms of molecular function. Standard similarity measures including correlation coefficients from Pearson, Spearman, Kendall’s Tau, Shrinkage, Partial, and Mutual information, and Euclidean and Manhattan distance measures were evaluated. Based on quality metrics such as cluster homogeneity and stability with respect to ontological categories, clusters resulting from partial correlation and mutual information were more biologically relevant than those from any other correlation measures. Second, statistical quality of clusters was evaluated using approaches based on permutation tests and Mantel correlation to identify significant and informative clusters that capture most of the covariance in the dataset. Third, the utility of statistical contrasts was studied for classification of temporal patterns of gene expression. Specifically, polynomial and Helmert contrast analyses were shown to provide a means of labeling the co-expressed gene sets because they showed similar temporal profiles

    Discovering Functional Modules by Clustering Gene Co-expression Networks

    Get PDF
    Identification of groups of functionally related genes from high throughput gene expression data is an important step towards elucidating gene functions at a global scale. Most existing approaches treat gene expression data as points in a metric space, and apply conventional clustering algorithms to identify sets of genes that are close to each other in the metric space. However, they usually ignore the topology of the underlying biological networks. In this paper, we propose a network-based clustering method that is biologically more realistic. Given a gene expression data set, we apply a rank-based transformation to obtain a sparse co-expression network, and use a novel spectral clustering algorithm to identify natural community structures in the network, which correspond to gene functional modules. We have tested the method on two large-scale gene expression data sets in yeast and Arabidopsis, respectively. The results show that the clusters identified by our method on these datasets are functionally richer and more coherent than the clusters from the standard k-means clustering algorithm

    Computational methods for analysis and modeling of time-course gene expression data

    Get PDF
    Genes encode proteins, some of which in turn regulate other genes. Such interactions make up gene regulatory relationships or (dynamic) gene regulatory networks. With advances in the measurement technology for gene expression and in genome sequencing, it has become possible to measure the expression level of thousands of genes simultaneously in a cell at a series of time points over a specific biological process. Such time-course gene expression data may provide a snapshot of most (if not all) of the interesting genes and may lead to a better understanding gene regulatory relationships and networks. However, inferring either gene regulatory relationships or networks puts a high demand on powerful computational methods that are capable of sufficiently mining the large quantities of time-course gene expression data, while reducing the complexity of the data to make them comprehensible. This dissertation presents several computational methods for inferring gene regulatory relationships and gene regulatory networks from time-course gene expression. These methods are the result of the author’s doctoral study. Cluster analysis plays an important role for inferring gene regulatory relationships, for example, uncovering new regulons (sets of co-regulated genes) and their putative cis-regulatory elements. Two dynamic model-based clustering methods, namely the Markov chain model (MCM)-based clustering and the autoregressive model (ARM)-based clustering, are developed for time-course gene expression data. However, gene regulatory relationships based on cluster analysis are static and thus do not describe the dynamic evolution of gene expression over an observation period. The gene regulatory network is believed to be a time-varying system. Consequently, a state-space model for dynamic gene regulatory networks from time-course gene expression data is developed. To account for the complex time-delayed relationships in gene regulatory networks, the state space model is extended to be the one with time delays. Finally, a method based on genetic algorithms is developed to infer the time-delayed relationships in gene regulatory networks. Validations of all these developed methods are based on the experimental data available from well-cited public databases

    gViz, a novel tool for the visualization of co-expression networks

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The quantity of microarray data available on the Internet has grown dramatically over the past years and now represents millions of Euros worth of underused information. One way to use this data is through co-expression analysis. To avoid a certain amount of bias, such data must often be analyzed at the genome scale, for example by network representation. The identification of co-expression networks is an important means to unravel gene to gene interactions and the underlying functional relationship between them. However, it is very difficult to explore and analyze a network of such dimensions. Several programs (Cytoscape, yEd) have already been developed for network analysis; however, to our knowledge, there are no available GraphML compatible programs.</p> <p>Findings</p> <p>We designed and developed gViz, a GraphML network visualization and exploration tool. gViz is built on clustering coefficient-based algorithms and is a novel tool to visualize and manipulate networks of co-expression interactions among a selection of probesets (each representing a single gene or transcript), based on a set of microarray co-expression data stored as an adjacency matrix.</p> <p>Conclusions</p> <p>We present here gViz, a software tool designed to visualize and explore large GraphML networks, combining network theory, biological annotation data, microarray data analysis and advanced graphical features.</p

    A Computational Framework for Learning from Complex Data: Formulations, Algorithms, and Applications

    Get PDF
    Many real-world processes are dynamically changing over time. As a consequence, the observed complex data generated by these processes also evolve smoothly. For example, in computational biology, the expression data matrices are evolving, since gene expression controls are deployed sequentially during development in many biological processes. Investigations into the spatial and temporal gene expression dynamics are essential for understanding the regulatory biology governing development. In this dissertation, I mainly focus on two types of complex data: genome-wide spatial gene expression patterns in the model organism fruit fly and Allen Brain Atlas mouse brain data. I provide a framework to explore spatiotemporal regulation of gene expression during development. I develop evolutionary co-clustering formulation to identify co-expressed domains and the associated genes simultaneously over different temporal stages using a mesh-generation pipeline. I also propose to employ the deep convolutional neural networks as a multi-layer feature extractor to generate generic representations for gene expression pattern in situ hybridization (ISH) images. Furthermore, I employ the multi-task learning method to fine-tune the pre-trained models with labeled ISH images. My proposed computational methods are evaluated using synthetic data sets and real biological data sets including the gene expression data from the fruit fly BDGP data sets and Allen Developing Mouse Brain Atlas in comparison with baseline existing methods. Experimental results indicate that the proposed representations, formulations, and methods are efficient and effective in annotating and analyzing the large-scale biological data sets

    A systematic comparison of data- and knowledge-driven approaches to disease subtype discovery

    Get PDF
    bbab314Typical clustering analysis for large-scale genomics data combines two unsupervised learning techniques: dimensionality reduction and clustering (DR-CL) methods. It has been demonstrated that transforming gene expression to pathway-level information can improve the robustness and interpretability of disease grouping results. This approach, referred to as biological knowledge-driven clustering (BK-CL) approach, is often neglected, due to a lack of tools enabling systematic comparisons with more established DR-based methods. Moreover, classic clustering metrics based on group separability tend to favor the DR-CL paradigm, which may increase the risk of identifying less actionable disease subtypes that have ambiguous biological and clinical explanations. Hence, there is a need for developing metrics that assess biological and clinical relevance. To facilitate the systematic analysis of BK-CL methods, we propose a computational protocol for quantitative analysis of clustering results derived from both DR-CL and BK-CL methods. Moreover, we propose a new BK-CL method that combines prior knowledge of disease relevant genes, network diffusion algorithms and gene set enrichment analysis to generate robust pathway-level information. Benchmarking studies were conducted to compare the grouping results from different DR-CL and BK-CL approaches with respect to standard clustering evaluation metrics, concordance with known subtypes, association with clinical outcomes and disease modules in co-expression networks of genes. No single approach dominated every metric, showing the importance multi-objective evaluation in clustering analysis. However, we demonstrated that, on gene expression data sets derived from TCGA samples, the BK-CL approach can find groupings that provide significant prognostic value in both breast and prostate cancers.Peer reviewe

    DeepReGraph co-clusters temporal gene expression and cis-regulatory elements through heterogeneous graph representation learning

    Get PDF
    This work presents DeepReGraph, a novel method for co-clustering genes and cis-regulatory elements (CREs) into candidate regulatory networks. Gene expression data, as well as data from three CRE activity markers from a publicly available dataset of mouse fetal heart tissue, were used for DeepReGraph concept proofing. In this study we used open chromatin accessibility from ATAC-seq experiments, as well as H3K27ac and H3K27me3 histone marks as CREs activity markers. However, this method can be executed with other sets of markers. We modelled all data sources as a heterogeneous graph and adapted a state-of-the-art representation learning algorithm to produce a low-dimensional and easy-to-cluster embedding of genes and CREs. Deep graph auto-encoders and an adaptive-sparsity generative model are the algorithmic core of DeepReGraph. The main contribution of our work is the design of proper combination rules for the heterogeneous gene expression and CRE activity data and the computational encoding of well-known gene expression regulatory mechanisms into a suitable objective function for graph embedding. We showed that the co-clusters of genes and CREs in the final embedding shed light on developmental regulatory mechanisms in mouse fetal-heart tissue. Such clustering could not be achieved by using only gene expression data. Function enrichment analysis proves that the genes in the co-clusters are involved in distinct biological processes. The enriched transcription factor binding sites in CREs prioritize the candidate transcript factors which drive the temporal changes in gene expression. Consequently, we conclude that DeepReGraph could foster hypothesis-driven tissue development research from high-throughput expression and epigenomic data. Full source code and data are available on the DeepReGraph GitHub project
    corecore