9 research outputs found

    COLOMBOS v2.0 : an ever expanding collection of bacterial expression compendia

    Get PDF
    The COLOMBOS database (http://www.colombos.net) features comprehensive organism-specific cross-platform gene expression compendia of several bacterial model organisms and is supported by a fully interactive web portal and an extensive web API. COLOMBOS was originally published in PLoS One, and COLOMBOS v2.0 includes both an update of the expression data, by expanding the previously available compendia and by adding compendia for several new species, and an update of the surrounding functionality, with improved search and visualization options and novel tools for programmatic access to the database. The scope of the database has also been extended to incorporate RNA-seq data in our compendia by a dedicated analysis pipeline. We demonstrate the validity and robustness of this approach by comparing the same RNA samples measured in parallel using both microarrays and RNA-seq. As far as we know, COLOMBOS currently hosts the largest homogenized gene expression compendia available for seven bacterial model organisms

    Coordinated functional divergence of genes after genome duplication in Arabidopsis thaliana

    Get PDF
    Gene and genome duplications have been rampant during the evolution of flowering plants. Unlike small-scale gene duplications, whole-genome duplications (WGDs) copy entire pathways or networks, and as such create the unique situation in which such duplicated pathways or networks could evolve novel functionality through the coordinated sub-or neofunctionalization of its constituent genes. Here, we describe a remarkable case of coordinated gene expression divergence following WGDs in Arabidopsis thaliana. We identified a set of 92 homoeologous gene pairs that all show a similar pattern of tissue-specific gene expression divergence following WGD, with one homoeolog showing predominant expression in aerial tissues and the other homoeolog showing biased expression in tip-growth tissues. We provide evidence that this pattern of gene expression divergence seems to involve genes with a role in cell polarity and that likely function in the maintenance of cell wall integrity. Following WGD, many of these duplicated genes evolved separate functions through subfunctionalization in growth/development and stress response. Uncoupling these processes through genome duplications likely provided important adaptations with respect to growth and morphogenesis and defense against biotic and abiotic stress

    Gene differential co-expression analysis of male infertility patients based on statistical and machine learning methods

    Get PDF
    Male infertility has always been one of the important factors affecting the infertility of couples of gestational age. The reasons that affect male infertility includes living habits, hereditary factors, etc. Identifying the genetic causes of male infertility can help us understand the biology of male infertility, as well as the diagnosis of genetic testing and the determination of clinical treatment options. While current research has made significant progress in the genes that cause sperm defects in men, genetic studies of sperm content defects are still lacking. This article is based on a dataset of gene expression data on the X chromosome in patients with azoospermia, mild and severe oligospermia. Due to the difference in the degree of disease between patients and the possible difference in genetic causes, common classical clustering methods such as k-means, hierarchical clustering, etc. cannot effectively identify samples (realize simultaneous clustering of samples and features). In this paper, we use machine learning and various statistical methods such as hypergeometric distribution, Gibbs sampling, Fisher test, etc. and genes the interaction network for cluster analysis of gene expression data of male infertility patients has certain advantages compared with existing methods. The cluster results were identified by differential co-expression analysis of gene expression data in male infertility patients, and the model recognition clusters were analyzed by multiple gene enrichment methods, showing different degrees of enrichment in various enzyme activities, cancer, virus-related, ATP and ADP production, and other pathways. At the same time, as this paper is an unsupervised analysis of genetic factors of male infertility patients, we constructed a simulated data set, in which the clustering results have been determined, which can be used to measure the effect of discriminant model recognition. Through comparison, it finds that the proposed model has a better identification effect

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    Mining large collections of gene expression data to elucidate transcriptional regulation of biological processes

    Get PDF
    A vast amount of gene expression data is available to biological researchers. As of October 2010, the GEO database has 45,777 chips of publicly available gene expression pro ling data from the Affymetrix (HGU133v2) GeneChip platform, representing 2.5 billion numerical measurements. Given this wealth of data, `meta-analysis' methods allowing inferences to be made from combinations of samples from different experiments are critically important. This thesis explores the application of localized pattern-mining approaches, as exemplified by biclustering, for large-scale gene expression analysis. Biclustering methods are particularly attractive for the analysis of large compendia of gene expression data as they allow the extraction of relationships that occur only across subsets of genes and samples. Standard correlation methods, however, assume a single correlation relationship between two genes occurs across all samples in the data. There are a number of existing biclustering methods, but as these did not prove suitable for large scale analysis, a novel method named `IslandCluster' was developed. This method provided a framework for investigating the results of different approaches to biclustering meta-analysis. The biclustering methods used in this work involve preprocessing of gene expression data into a unified scale in order to assess the significance of expression patterns. A novel discretisation approach is shown to identify distinct classes of genes' expression values more appropriately than approaches reported in the literature. A Gene Expression State Transformation (`GESTr') introduced as the first reported modelling of the biological state of expression on a unified scale and is shown to facilitate effective meta-analysis. Localised co-dependency analysis is introduced, a paradigm for identifying transcriptional relationships from gene expression data. Tools implementing this analysis were developed and used to analyse specificity of transcriptional relationships, to distinguish related subsets within a set of transcription factor (TF) targets and to tease apart combinatorial regulation of a set of targets by multiple TFs. The state of pluripotency, from which a mammalian cell has the potential to differentiate into any cell from any of the three adult germ layers, is maintained by forced expression of Nanog and may be induced from a non-pluripotent state by the expression of Oct4, Sox2, Klf4 and cMyc. Analysis of cMyc regulatory targets shed light on a recent proposition that cMyc induces an `embryonic stem cell like' transcriptional signature outside embryonic stem (ES) cells, revealing a cMyc-responsive subset of the signature and identifying ES cell expressed targets with evidence of broad cMyc-induction. Regulatory targets through which cMyc, Oct4, Sox2 and Nanog may maintain or induce pluripotency were identified, offering insight into transcriptional mechanisms involved in the control of pluripotency and demonstrating the utility of the novel analysis approaches presented in this work

    Transcriptome-based Gene Networks for Systems-level Analysis of Plant Gene Functions

    Get PDF
    Present day genomic technologies are evolving at an unprecedented rate, allowing interrogation of cellular activities with increasing breadth and depth. However, we know very little about how the genome functions and what the identified genes do. The lack of functional annotations of genes greatly limits the post-analytical interpretation of new high throughput genomic datasets. For plant biologists, the problem is much severe. Less than 50% of all the identified genes in the model plant Arabidopsis thaliana, and only about 20% of all genes in the crop model Oryza sativa have some aspects of their functions assigned. Therefore, there is an urgent need to develop innovative methods to predict and expand on the currently available functional annotations of plant genes. With open-access catching the ‘pulse’ of modern day molecular research, an integration of the copious amount of transcriptome datasets allows rapid prediction of gene functions in specific biological contexts, which provide added evidence over traditional homology-based functional inference. The main goal of this dissertation was to develop data analysis strategies and tools broadly applicable in systems biology research. Two user friendly interactive web applications are presented: The Rice Regulatory Network (RRN) captures an abiotic-stress conditioned gene regulatory network designed to facilitate the identification of transcription factor targets during induction of various environmental stresses. The Arabidopsis Seed Active Network (SANe) is a transcriptional regulatory network that encapsulates various aspects of seed formation, including embryogenesis, endosperm development and seed-coat formation. Further, an edge-set enrichment analysis algorithm is proposed that uses network density as a parameter to estimate the gain or loss in correlation of pathways between two conditionally independent coexpression networks

    An ensemble biclustering approach for querying gene expression compendia with experimental lists

    No full text
    Motivation: Query-based biclustering techniques allow interrogating a gene expression compendium with a given gene or gene list. They do so by searching for genes in the compendium that have a profile close to the average expression profile of the genes in this query-list. As it can often not be guaranteed that the genes in a long query-list will all be mutually coexpressed, it is advisable to use each gene separately as a query. This approach, however, leaves the user with a tedious post-processing of partially redundant biclustering results. The fact that for each query-gene multiple parameter settings need to be tested in order to detect the 'most optimal bicluster size' adds to the redundancy problem. Results: To aid with this post-processing, we developed an ensemble approach to be used in combination with query-based biclustering. The method relies on a specifically designed consensus matrix in which the biclustering outcomes for multiple query-genes and for different possible parameter settings are merged in a statistically robust way. Clustering of this matrix results in distinct, non-redundant consensus biclusters that maximally reflect the information contained within the original query-based biclustering results. The usefulness of the developed approach is illustrated on a biological case study in Escherichia coli

    Environmental tuning of the genetic control of seed performance : a systems genetics approach

    Get PDF
    The environmental conditions under which plants grow affect the quality of seeds produced in a genotype-dependent manner. In nature, genotype-by-environment interactions are often observed however little is known about the underlying mechanisms. The combined use of genetic tools and omics data can help to explore the influence of the environment on the genetic control of seed performance. The research presented in this thesis explores genotype-by-environment interaction at the phenotypic with an effort to connect phenotypic changes to changes observed at the metabolome and transcriptome in a systems genetics approach. For this purpose, an Arabidopsis thaliana recombinant inbred lines population derived from the cross between the parental lines Bay-0 and Sha was grown under different conditions, namely standard, high light, high temperature and low phosphate conditions from flowering until seed harvest. The germination properties of the seeds produced under the different environments were investigated and the seed germination QTLs identified displayed large QTL-by-environment interaction. Quantitative changes in primary metabolites in response to the maternal environment were investigated by GC-TOF-MS. Further, mQTLs under the different environments were identified. RNA-seq of the same lines enabled to explore changes in gene expression across genotypes and environments as well as differences in the eQTL landscape under the different maternal environment. The findings of this research show that seed quality is largely influenced by genotype-by-environment interactions which result in large changes at the molecular level. The data generated provide many opportunities to further study.</p
    corecore