696 research outputs found

    Advancing systems biology of yeast through machine learning and comparative genomics

    Get PDF
    Synthetic biology has played a pivotal role in accomplishing the production of high value commodities, pharmaceuticals, and bulk chemicals. Fueled by the breakthrough of synthetic biology and metabolic engineering, Saccharomyces cerevisiae and various other yeasts (such as Yarrowia lipolytica, Pichia pastoris) have been proven to be promising microbial cell factories and are frequently used in scientific studies. However, the cellular metabolism and physiological properties for most of the yeast species have not been characterized in detail. To address these knowledge gaps, this thesis aims to leverage the large amounts of data available for yeast species and use state-of-the-art machine learning techniques and comparative genomic analysis to gain a deeper insight into yeast traits and metabolism.In this thesis, machine learning was applied to various unresolved biological problems on yeasts, i.e., gene essentiality, enzyme turnover number (kcat), and protein production. In the first part of the work, machine learning approaches were employed to predict gene essentiality based on sequence features and evolutionary features. It was demonstrated that the essential gene prediction could be substantially improved by integrating evolution-based features. Secondly, a high-quality deep learning model DLKcat was developed to predict kcat\ua0values by combining a graph neural network for substrates and a convolutional neural network for proteins. By predicting kcat profiles for 343 yeast/fungi species, enzyme-constrained models were reconstructed and used to further elucidate the cellular metabolism on a large scale. Lastly, a random forest algorithm was adopted to investigate feature importance analysis on protein production, it was found that post-translational modifications (PTMs) have a relatively higher impact on protein production compared with amino acid composition. In comparative genomics, a comprehensive toolbox HGTphyloDetect was developed to facilitate the identification of horizontal gene transfer (HGT) events. Case studies on some yeast species demonstrated the ability of HGTphyloDetect to identify horizontally acquired genes with high accuracy. In addition, through systematic evolution analysis (e.g., HGT, gene family expansion) and genome-scale metabolic model simulation, the underlying mechanisms for substrate utilization were further probed across large-scale yeast species

    The topology of the bacterial co-conserved protein network and its implications for predicting protein function

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein-protein interactions networks are most often generated from physical protein-protein interaction data. Co-conservation, also known as phylogenetic profiles, is an alternative source of information for generating protein interaction networks. Co-conservation methods generate interaction networks among proteins that are gained or lost together through evolution. Co-conservation is a particularly useful technique in the compact bacteria genomes. Prior studies in yeast suggest that the topology of protein-protein interaction networks generated from physical interaction assays can offer important insight into protein function. Here, we hypothesize that in bacteria, the topology of protein interaction networks derived via co-conservation information could similarly improve methods for predicting protein function. Since the topology of bacteria co-conservation protein-protein interaction networks has not previously been studied in depth, we first perform such an analysis for co-conservation networks in <it>E. coli </it>K12. Next, we demonstrate one way in which network connectivity measures and global and local function distribution can be exploited to predict protein function for previously uncharacterized proteins.</p> <p>Results</p> <p>Our results showed, like most biological networks, our bacteria co-conserved protein-protein interaction networks had scale-free topologies. Our results indicated that some properties of the physical yeast interaction network hold in our bacteria co-conservation networks, such as high connectivity for essential proteins. However, the high connectivity among protein complexes in the yeast physical network was not seen in the co-conservation network which uses all bacteria as the reference set. We found that the distribution of node connectivity varied by functional category and could be informative for function prediction. By integrating of functional information from different annotation sources and using the network topology, we were able to infer function for uncharacterized proteins.</p> <p>Conclusion</p> <p>Interactions networks based on co-conservation can contain information distinct from networks based on physical or other interaction types. Our study has shown co-conservation based networks to exhibit a scale free topology, as expected for biological networks. We also revealed ways that connectivity in our networks can be informative for the functional characterization of proteins.</p

    Comparative analysis of gene expression associations between mammalian hosts and Plasmodium

    Get PDF
    Artenübergreifende Interaktionen helfen uns, Krankheitsmechanismen zu verstehen und Targets für Therapien zu finden. Die Koexpression von Genen, gemessen an der mRNA-Häufigkeit, kann Interaktionen zwischen Wirt und Pathogen aufzeigen. Die RNA-Sequenzierung von Wirt und Pathogen wird als "duale RNA-Sequenzierung" bezeichnet. Malaria ist eine der am besten untersuchten parasitären Krankheiten, so dass eine Fülle von RNA-seq-Datensätzen öffentlich zugänglich ist. Die Autoren führen entweder duale RNA-seq durch, um den Wirt und den Parasiten gleichzeitig zu untersuchen, oder sie erhalten kontaminierende Sequenzierungs-Reads aus dem Nicht-Zielorganismus. Ich habe eine Meta-Analyse durchgeführt, bei diese beiden Arten von RNA-seq-Studien verwendet wurden, um über korrelierte Genexpression auf Wirt-Parasit-Interaktionen zu schließen. Ich habe Studien mit Homo sapiens, Mus musculus und Macaca mulatta als Wirte und ihre Plasmodium-Parasiten einbezogen. Ich benutzte orthologe Einzelkopien von Genen, um ein Repertoire von Interaktionen bei Malaria und in diesen Modellsystemen zu erstellen. Ich verknüpfte die Daten von 63 Plasmodium-Phasen-spezifischen Studien und reduzierte die Zahl der Interaktionen von potenziell 56 Millionen auf eine kleinere, relevantere Menge. Die Zentralität in den Netzwerken der Blutphasen konnte die Essentialität der Plasmodium-Gene erklären. Das aus den verketteten Daten sagte die Genessenzialität besser vor als die einzelnen Studien - ein Vorteil der Meta-Analyse. Neutrophile und Monozyten Immunmarkergene waren überrepräsentiert, was auf eine Fülle von phagozytären und respiratorischen Reaktionen hindeutet. Die Analyse der Leberphase ergab Wirts- und Parasitenprozesse in frühen und späten Entwicklungsphasen. Ich fand bekannte Wirt-Parasit-Interaktionen, die für beide Phasen gleich sind, sowie bisher unbekannte Interaktionen. Dieses Prinzip lässt sich auch auf andere Krankheiten anwenden, um Mechanismen und therapeutische Ziele zu verstehen.Cross-species interactions help us understand disease mechanisms and find targets for therapy. Gene co-expression, measured by mRNA abundance, can identify host-pathogen interactions. The RNA-sequencing of host and pathogen is termed “dual RNA-sequencing”. Malaria is one of the most studied eukayotic parasitic diseases, making an abundance of RNA-seq data sets publicly available. Authors either perform dual RNA-seq to study the host and parasite simultaneously or acquire contaminant sequencing reads from the non-target organism. I performed a meta-analysis using these two kinds of RNA-seq studies to infer host-parasite interactions using correlated gene expression. I included studies of Homo sapiens, Mus musculus and Macaca mulatta as hosts and their corresponding Plasmodium parasites. I used single-copy orthologous genes to generate a repertoire of interactions in human malaria and in these model systems. I found 63 malaria RNA-seq studies. I concatenated sequencing runs from Plasmodium stage-specific studies and reduced the number of interactions from a potential 56 million to a smaller, more relevant set. Centrality in the blood stage networks was able to explain Plasmodium gene essentiality. The network from the concatenated data predicted gene essentiality better than the individual studies, indicating a benefit of the meta-analysis. Immune marker genes for neutrophils and monocytes were over-represented, suggesting an abundance of phagocytic and respiratory burst-related responses. The liver stage analysis revealed linked host and parasite processes at early stages until late developmental stages. I found linked host and parasite processes that are common to the two stages, e.g. parasite cell gliding and invasion and host response to hypoxia and immune response. I showed that existing data can be explored for new information. This principle can be applied to other diseases to understand mechanisms and therapeutic targets

    Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks

    Get PDF
    BACKGROUND: Genes and proteins are organized into functional modular networks in which the network context of a gene or protein has implications for cellular function. Highly connected hub proteins, largely responsible for maintaining network connectivity, have been found to be much more likely to be essential for yeast survival. RESULTS: Here we investigate the properties of weighted gene co-expression networks formed from multiple microarray datasets. The constructed networks approximate scale-free topology, but this is not universal across all datasets. We show strong positive correlations between gene connectivity within the whole network and gene essentiality as well as gene sequence conservation. We demonstrate the preservation of a modular structure of the networks formed, and demonstrate that, within some of these modules, it is possible to observe a strong correlation between connectivity and essentiality or between connectivity and conservation within the modules particularly within modules containing larger numbers of essential genes. CONCLUSION: Application of these techniques can allow a finer scale prediction of relative gene importance for a particular process within a group of similarly expressed genes

    Yeast metabolic innovations emerged via expanded metabolic network and gene positive selection

    Get PDF
    Yeasts are known to have versatile metabolic traits, while how these metabolic traits have evolved has not been elucidated systematically. We performed integrative evolution analysis to investigate how genomic evolution determines trait generation by reconstructing genome-scale metabolic models (GEMs) for 332 yeasts. These GEMs could comprehensively characterize trait diversity and predict enzyme functionality, thereby signifying that sequence-level evolution has shaped reaction networks towards new metabolic functions. Strikingly, using GEMs, we can mechanistically map different evolutionary events, e.g. horizontal gene transfer and gene duplication, onto relevant subpathways to explain metabolic plasticity. This demonstrates that gene family expansion and enzyme promiscuity are prominent mechanisms for metabolic trait gains, while GEM simulations reveal that additional factors, such as gene loss from distant pathways, contribute to trait losses. Furthermore, our analysis could pinpoint to specific genes and pathways that have been under positive selection and relevant for the formulation of complex metabolic traits, i.e. thermotolerance and the Crabtree effect. Our findings illustrate how multidimensional evolution in both metabolic network structure and individual enzymes drives phenotypic variations

    Functional and evolutionary implications of in silico gene deletions

    Get PDF
    Understanding how genetic modifications, individual or in combination, affect organismal fitness or other phenotypes is a challenge common to several areas of biology, including human health & genetics, metabolic engineering, and evolutionary biology. The importance of a gene can be quantified by measuring the phenotypic impact of its associated genetic perturbations "here and now", e.g. the growth rate of a mutant microbe. However, each gene also maintains a historical record of its cumulative importance maintained throughout millions of years of natural selection in the form of its degree of sequence conservation along phylogenetic branches. This thesis focuses on whether and how the phenotypic and evolutionary importance of genes are related to each other. Towards this goal, I developed a new approach for characterizing the phenotypic consequences of genetic modifications in genome-scale biochemical networks using constraint-based computational models of metabolism. In particular, I investigated the impact of gene loss events on fitness in the model organism Saccharomyces cerevisiae, and found that my new metric for estimating the cost of gene deletion correlates with gene evolutionary rate. I found that previous failures to uncover this correlation using similar techniques may have been the result of an incorrect assumption about how isoenzymes deletions affect the reaction they catalyze. I next hypothesized that the improvement my metric showed in predicting the cost of isoenzyme loss could translate into an improved capacity to predict the impact of pairs of gene deletions involving isoenzymes. Studies of such pair-wise genetic perturbations are important, because the extent to which a genetic perturbation modifies any given phenotype is often dependent on the genetic background upon which it has been performed. This lack of independence within sets of perturbations is termed epistasis. My results showed that, indeed, the new metric displays an increased capacity to predict epistatic interactions between pairs of genes. In addition to shedding light on the relationship between the functional and evolutionary importance of genes, further developments of our approach may lead to better prediction of gene knockout phenotypes, with applications ranging from metabolic engineering to the search for gene targets for therapeutic applications

    Network analyses of proteome evolution and diversity

    Full text link
    The mapping of biomolecular interactions reveals that the function of most biological components depends on a web of interrelations with other cellular components, stressing the need for a systems-level view of biological functions. In this work, I explore ways in which the integration of network and genomic information from different organizational levels can lead to a better understanding of cellular systems and components. First, studying yeast, I show that the evolutionary properties of target genes constitute the dominant determinant of transcription factor (TF) evolutionary rate and that this evolutionary modularity is limited to activating regulatory relationships. I also show that targets of fast-evolving TFs show greater evolutionary expression changes and are enriched for niche-specific functions and other TFs. This work highlights the importance of trans-regulatory network evolution in species-specific gene expression and network adaptation. Next, I show that genes either lost or gained across fungal evolution are enriched in TFs and have very different network and genomic properties than universally conserved genes, including, in sharp contrast to other networks, a greater number of transcriptional regulators. Placing genes in the context of their evolutionary life-cycle reveals principles of network integration of gained genes and evidence for the progressive network and functional marginalization of genes as an evolutionary process preceding gene loss. In the final chapter, I study how alternative splicing (AS)-driven expansion of human proteome diversity leads to system-level complexity through the AS-mediated rewiring of the protein-protein interaction network. By overlaying different network and genomic datasets onto the first large-scale isoform-resolution interactome, I found that differentiating between splice variants is essential to capturing the full extent of the network's functional modularity. I also discovered that AS-mediated rewiring preferentially affects tissue-specific genes and that topologically different patterns of rewiring have distinct functional consequences. Furthermore, I found that most rewiring can be traced to the AS of evolutionarily conserved sequence modules, which promote or block interactions and tend to overlap linear motifs and disrupt known domain-domain interactions. Together, this work demonstrates that a network-level perspective and genomic data integration are essential to understanding the evolution and functional diversity of proteomes
    corecore