4,053 research outputs found

    ROBUST CROSS-PLATFORM DISEASE PREDICTION USING GENE EXPRESSION MICROARRAYS

    Get PDF
    Microarray technology has been used to predict patient prognosis and response to treatment, which is starting to have an impact on disease intervention and control, and is a significant measure for public health. However, the process has been hindered by a lack of adequate clinical validation. Since both microarray analyses and clinical trials are time and effort intensive, it is crucial to use accumulated inter-study data to validate information from individual studies. For over a decade, microarray data have been accumulated from different technologies. However, using data from one platform to build a model that robustly predicts the clinical characteristics of a new data from another platform remains a challenge. Current cross-platform gene prediction methods use only genes common to both training and test datasets. There are two main drawbacks to that approach: model reconstruction and loss of information. As a result, the prediction accuracy of those methods is unstable. In this dissertation, a module-based prediction strategy was developed to overcome the aforementioned drawbacks. By the current method, groups of genes sharing similar expression patterns rather than individual genes were used as the basic elements of the model predictor. Such an approach borrows information from genes¡¯ similarity when genes are absent in test data. By overcoming the problems of missing genes and noise across platforms, this method yielded robust predictions independent of information from the test data. The performance of this method was evaluated using publicly available microarray data. K-means clustering was used to group genes sharing similar expression profiles into gene modules and small modules were merged into their nearest neighbors. A univariate or multivariate feature selection procedures was applied and a representative gene from each selected module was identified. A prediction model was then constructed by the representative genes from selected gene modules. As a result, the prediction model is portable to any test study as long as partial genes in each module exist in the test study. The newly developed method showed advantages over the traditional methods in terms of prediction robustness to gene noise and gene mismatch issues in inter-study prediction

    Systematic identification of functional plant modules through the integration of complementary data sources

    Get PDF
    A major challenge is to unravel how genes interact and are regulated to exert specific biological functions. The integration of genome-wide functional genomics data, followed by the construction of gene networks, provides a powerful approach to identify functional gene modules. Large-scale expression data, functional gene annotations, experimental protein-protein interactions, and transcription factor-target interactions were integrated to delineate modules in Arabidopsis (Arabidopsis thaliana). The different experimental input data sets showed little overlap, demonstrating the advantage of combining multiple data types to study gene function and regulation. In the set of 1,563 modules covering 13,142 genes, most modules displayed strong coexpression, but functional and cis-regulatory coherence was less prevalent. Highly connected hub genes showed a significant enrichment toward embryo lethality and evidence for cross talk between different biological processes. Comparative analysis revealed that 58% of the modules showed conserved coexpression across multiple plants. Using module-based functional predictions, 5,562 genes were annotated, and an evaluation experiment disclosed that, based on 197 recently experimentally characterized genes, 38.1% of these functions could be inferred through the module context. Examples of confirmed genes of unknown function related to cell wall biogenesis, xylem and phloem pattern formation, cell cycle, hormone stimulus, and circadian rhythm highlight the potential to identify new gene functions. The module-based predictions offer new biological hypotheses for functionally unknown genes in Arabidopsis (1,701 genes) and six other plant species (43,621 genes). Furthermore, the inferred modules provide new insights into the conservation of coexpression and coregulation as well as a starting point for comparative functional annotation

    Inferring a Transcriptional Regulatory Network from Gene Expression Data Using Nonlinear Manifold Embedding

    Get PDF
    Transcriptional networks consist of multiple regulatory layers corresponding to the activity of global regulators, specialized repressors and activators of transcription as well as proteins and enzymes shaping the DNA template. Such intrinsic multi-dimensionality makes uncovering connectivity patterns difficult and unreliable and it calls for adoption of methodologies commensurate with the underlying organization of the data source. Here we present a new computational method that predicts interactions between transcription factors and target genes using a compendium of microarray gene expression data and the knowledge of known interactions between genes and transcription factors. The proposed method called Kernel Embedding of REgulatory Networks (KEREN) is based on the concept of gene-regulon association and it captures hidden geometric patterns of the network via manifold embedding. We applied KEREN to reconstruct gene regulatory interactions in the model bacteria E.coli on a genome-wide scale. Our method not only yields accurate prediction of verifiable interactions, which outperforms on certain metrics comparable methodologies, but also demonstrates the utility of a geometric approach to the analysis of high-dimensional biological data. We also describe the general application of kernel embedding techniques to some other function and network discovery algorithms

    Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large-scale sequencing of entire genomes has ushered in a new age in biology. One of the next grand challenges is to dissect the cellular networks consisting of many individual functional modules. Defining co-expression networks without ambiguity based on genome-wide microarray data is difficult and current methods are not robust and consistent with different data sets. This is particularly problematic for little understood organisms since not much existing biological knowledge can be exploited for determining the threshold to differentiate true correlation from random noise. Random matrix theory (RMT), which has been widely and successfully used in physics, is a powerful approach to distinguish system-specific, non-random properties embedded in complex systems from random noise. Here, we have hypothesized that the universal predictions of RMT are also applicable to biological systems and the correlation threshold can be determined by characterizing the correlation matrix of microarray profiles using random matrix theory.</p> <p>Results</p> <p>Application of random matrix theory to microarray data of <it>S. oneidensis</it>, <it>E. coli</it>, yeast, <it>A. thaliana</it>, <it>Drosophila</it>, mouse and human indicates that there is a sharp transition of nearest neighbour spacing distribution (NNSD) of correlation matrix after gradually removing certain elements insider the matrix. Testing on an <it>in silico </it>modular model has demonstrated that this transition can be used to determine the correlation threshold for revealing modular co-expression networks. The co-expression network derived from yeast cell cycling microarray data is supported by gene annotation. The topological properties of the resulting co-expression network agree well with the general properties of biological networks. Computational evaluations have showed that RMT approach is sensitive and robust. Furthermore, evaluation on sampled expression data of an <it>in silico </it>modular gene system has showed that under-sampled expressions do not affect the recovery of gene co-expression network. Moreover, the cellular roles of 215 functionally unknown genes from yeast, <it>E. coli </it>and <it>S. oneidensis </it>are predicted by the gene co-expression networks using guilt-by-association principle, many of which are supported by existing information or our experimental verification, further demonstrating the reliability of this approach for gene function prediction.</p> <p>Conclusion</p> <p>Our rigorous analysis of gene expression microarray profiles using RMT has showed that the transition of NNSD of correlation matrix of microarray profile provides a profound theoretical criterion to determine the correlation threshold for identifying gene co-expression networks.</p

    Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli.

    Get PDF
    A significant obstacle in training predictive cell models is the lack of integrated data sources. We develop semi-supervised normalization pipelines and perform experimental characterization (growth, transcriptional, proteome) to create Ecomics, a consistent, quality-controlled multi-omics compendium for Escherichia coli with cohesive meta-data information. We then use this resource to train a multi-scale model that integrates four omics layers to predict genome-wide concentrations and growth dynamics. The genetic and environmental ontology reconstructed from the omics data is substantially different and complementary to the genetic and chemical ontologies. The integration of different layers confers an incremental increase in the prediction performance, as does the information about the known gene regulatory and protein-protein interactions. The predictive performance of the model ranges from 0.54 to 0.87 for the various omics layers, which far exceeds various baselines. This work provides an integrative framework of omics-driven predictive modelling that is broadly applicable to guide biological discovery

    High Accordance in Prognosis Prediction of Colorectal Cancer across Independent Datasets by Multi-Gene Module Expression Profiles

    Get PDF
    A considerable portion of patients with colorectal cancer have a high risk of disease recurrence after surgery. These patients can be identified by analyzing the expression profiles of signature genes in tumors. But there is no consensus on which genes should be used and the performance of specific set of signature genes varies greatly with different datasets, impeding their implementation in the routine clinical application. Instead of using individual genes, here we identified functional multi-gene modules with significant expression changes between recurrent and recurrence-free tumors, used them as the signatures for predicting colorectal cancer recurrence in multiple datasets that were collected independently and profiled on different microarray platforms. The multi-gene modules we identified have a significant enrichment of known genes and biological processes relevant to cancer development, including genes from the chemokine pathway. Most strikingly, they recruited a significant enrichment of somatic mutations found in colorectal cancer. These results confirmed the functional relevance of these modules for colorectal cancer development. Further, these functional modules from different datasets overlapped significantly. Finally, we demonstrated that, leveraging above information of these modules, our module based classifier avoided arbitrary fitting the classifier function and screening the signatures using the training data, and achieved more consistency in prognosis prediction across three independent datasets, which holds even using very small training sets of tumors

    Transcriptome-based Gene Networks for Systems-level Analysis of Plant Gene Functions

    Get PDF
    Present day genomic technologies are evolving at an unprecedented rate, allowing interrogation of cellular activities with increasing breadth and depth. However, we know very little about how the genome functions and what the identified genes do. The lack of functional annotations of genes greatly limits the post-analytical interpretation of new high throughput genomic datasets. For plant biologists, the problem is much severe. Less than 50% of all the identified genes in the model plant Arabidopsis thaliana, and only about 20% of all genes in the crop model Oryza sativa have some aspects of their functions assigned. Therefore, there is an urgent need to develop innovative methods to predict and expand on the currently available functional annotations of plant genes. With open-access catching the ‘pulse’ of modern day molecular research, an integration of the copious amount of transcriptome datasets allows rapid prediction of gene functions in specific biological contexts, which provide added evidence over traditional homology-based functional inference. The main goal of this dissertation was to develop data analysis strategies and tools broadly applicable in systems biology research. Two user friendly interactive web applications are presented: The Rice Regulatory Network (RRN) captures an abiotic-stress conditioned gene regulatory network designed to facilitate the identification of transcription factor targets during induction of various environmental stresses. The Arabidopsis Seed Active Network (SANe) is a transcriptional regulatory network that encapsulates various aspects of seed formation, including embryogenesis, endosperm development and seed-coat formation. Further, an edge-set enrichment analysis algorithm is proposed that uses network density as a parameter to estimate the gain or loss in correlation of pathways between two conditionally independent coexpression networks
    • …
    corecore