17 research outputs found

    Computational Analysis of Microbial Sequence Data Using Statistics and Machine Learning

    Get PDF
    Since the discovery of the double helix of DNA in 1953, modern molecular biology has opened the door to a better understanding of how genes control chemical processes within cells, including protein synthesis. Although we are still far from claiming a complete understanding, recent advances in sequencing technologies, increased computational capacity, and more sophisticated computational methods have allowed the development of various new applications that provide further insight into DNA sequence data and how the information they encode impacts living organisms and their environment. Sequencing data can now be used to start identifying the relationships between microorganisms, where they live, and in some cases how they affect their host organisms. We introduce and compare methods used for this bioinformatics application, and develop a machine learning model that can be used to effectively predict environmental factors associated with these microorganisms. Codon Usage Bias (CUB), which refers to the highly non-uniform usage of codons that code for the same amino acid has been known to reflect the expression level of a protein-coding gene under the evolutionary theory that selection favors certain synonymous codons. Traditional methods used to estimate CUB and its relation with protein translation have been proven effective on single-celled organisms such as yeast and E. coli, but their applications are limited when it comes to more complex multi-cellular organisms such as plants and animals. To extend our abilities to further understand the relations between codon usage patterns and the protein translation processes in these organisms, we develop a novel deep learning model that can discover patterns in codon usage bias between different species using only their DNA sequences

    ExpressInHost: A codon tuning tool for the expression of recombinant proteins in host microorganisms

    Full text link
    ExpressInHost (https://gitlab.com/a.raguin/expressinhost) is a GTK/C++ based user friendly graphical interface that allows tuning the codon sequence of an mRNA for recombinant protein expression in a host microorganism. Heterologous gene expression is widely implemented in biotechnology companies and academic research laboratories. However, expression of recombinant proteins can be challenging. On the one hand, maximising translation speed is important, especially in scalable production processes relevant to biotechnology companies, but on the other hand, solubility problems often arise as a consequence, since translation "pauses" might be key to allow the nascent polypeptide chain to fold appropriately. To address this challenge, we have developed a software that offers three distinct modes to tune codon sequences using the genetic code redundancy. The tuning strategies implemented take into account the specific tRNA resources of the host and that of the native organism. They balance rapid translation and native speed mimicking to allow proper protein folding, thereby avoiding protein solubility problems

    ExpressInHost : A codon tuning tool for the expression of recombinant proteins in host microorganisms

    Get PDF
    Funding Information This work was performed as part of the Innovate UK project “Predictive optimisation of biocatalyst production for high-value chemical manufacturing” (Project Number TP101439). The current position of A.R. is funded by the German federal and state programme Professorinnenprogramms III for female scientists.Peer reviewedPublisher PD

    Codon usage clusters correlation: Towards protein solubility prediction in heterologous expression systems in E. coli

    Get PDF
    Production of soluble recombinant proteins is crucial to the development of industry and basic research. However, the aggregation due to the incorrect folding of the nascent polypeptides is still a mayor bottleneck. Understanding the factors governing protein solubility is important to grasp the underlying mechanisms and improve the design of recombinant proteins. Here we show a quantitative study of the expression and solubility of a set of proteins from Bizionia argentinensis. Through the analysis of different features known to modulate protein production, we defined two parameters based on the %MinMax algorithm to compare codon usage clusters between the host and the target genes. We demonstrate that the absolute difference between all %MinMax frequencies of the host and the target gene is significantly negatively correlated with protein expression levels. But most importantly, a strong positive correlation between solubility and the degree of conservation of codons usage clusters is observed for two independent datasets. Moreover, we evince that this correlation is higher in codon usage clusters involved in less compact protein secondary structure regions. Our results provide important tools for protein design and support the notion that codon usage may dictate translation rate and modulate co-Translational folding.Fil: Pellizza Pena, Leonardo Agustín. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; ArgentinaFil: Smal, Clara. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; ArgentinaFil: Rodrigo, Guido. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; ArgentinaFil: Aran, Martin. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; Argentin

    Genome-wide changes in protein translation efficiency are associated with autism

    Get PDF
    We previously proposed that changes in the efficiency of protein translation are associated with autism spectrum disorders (ASDs). This hypothesis connects environmental factors and genetic factors because each can alter translation efficiency. For genetic factors, we previously tested our hypothesis using a small set of ASD-associated genes, a small set of ASD-associated variants, and a statistic to quantify by how much a single nucleotide variant (SNV) in a protein coding region changes translation speed. In this study, we confirm and extend our hypothesis using a published set of 1,800 autism quartets (parents, one affected child and one unaffected child) and genome-wide variants. Then, we extend the test statistic to combine translation efficiency with other possibly relevant variables: ribosome profiling data, presence/absence of CpG dinucleotides, and phylogenetic conservation. The inclusion of ribosome profiling abundances strengthens our results for male–male sibling pairs. The inclusion of CpG information strengthens our results for female–female pairs, giving an insight into the significant gender differences in autism incidence. By combining the single-variant test statistic for all variants in a gene, we obtain a single gene score to evaluate how well a gene distinguishes between affected and unaffected siblings. Using statistical methods, we compute gene sets that have some power to distinguish between affected and unaffected siblings by translation efficiency of gene variants. Pathway and enrichment analysis of those gene sets suggest the importance of Wnt signaling pathways, some other pathways related to cancer, ATP binding, and ATP-ase pathways in the etiology of ASDs

    Translational control by ribosome pausing in bacteria: How a non-uniform pace of translation affects protein production and folding

    Get PDF
    Protein homeostasis of bacterial cells is maintained by coordinated processes of protein production, folding, and degradation. Translational efficiency of a given mRNA depends on how often the ribosomes initiate synthesis of a new polypeptide and how quickly they read the coding sequence to produce a full-length protein. The pace of ribosomes along the mRNA is not uniform: periods of rapid synthesis are separated by pauses. Here, we summarize recent evidence on how ribosome pausing affects translational efficiency and protein folding. We discuss the factors that slow down translation elongation and affect the quality of the newly synthesized protein. Ribosome pausing emerges as important factor contributing to the regulatory programs that ensure the quality of the proteome and integrate the cellular and environmental cues into regulatory circuits of the cell

    The Evolutionary and Functional Roles of Synonymous Codon Usage in Eukaryotes

    Get PDF
    Most amino acids are encoded by multiple synonymous codons. Although alternative usage of synonymous codons does not affect the amino acid sequences of proteins, researchers have been reporting evidence for functional synonymous codon usage at the species- and gene-specific levels for over four decades. It has been shown that variations in synonymous codon usage can affect phenotypes through diverse mechanisms such as shaping translation efficiency and mRNA stability. On the other hand, the common view that cellular and organismal phenotypes are primarily determined by proteins whose functions are primarily determined by amino acid sequences, often drives the assumption that synonymous mutations are evolutionarily neutral. Consequently, this assumption has been used extensively in evolutionary biology, population genetics, and structural biology. One explanation of the apparent contradiction between the empirical findings, which indicate that synonymous mutations can affect related phenotypes, and the theoretical models, which stipulate that synonymous mutations are neutral, is that neutral synonymous mutations represent the general rule while non-neutral synonymous mutations represent the rare exceptions. In my thesis, I examined this explanation by applying computational and experimental approaches, which indicated that: 1) Non-neutral synonymous mutations significantly affect a considerable proportion of protein-coding genes; 2) Gene-specific codon usage patterns, such as the preference for a specific combination of rare codons, are possibly associated with specific gene functions, such as enhancing tissue-specific gene expression; 3) Some protein-coding genes include codon clusters whose codon usage patterns cannot be explained by selection-independent processes, and thus such codon clusters seem to serve as domains affecting protein functions. Together, these data suggest that synonymous mutations should not be a priori considered neutral. Furthermore, my studies suggest that the biochemical functions of at least some proteins are not only shaped by the constituent amino acid residues but also by codon usage biases at the gene-specific and sub-genic levels. In conclusion, my thesis work suggests that many of the commonly used approaches for analyzing the selection on protein-coding DNA sequences, which rely on the assumption that synonymous mutations are generally neutral, may generate biased results. Furthermore, my studies indicate that selection on gene-specific codon usage bias has evolved to serve diverse biological functions, which are still mostly uncharacterized

    Unveiling the secrets of DNA : Improved expression and phage display efficiency of synthetic recombinant binding proteins in E. coli through modulation of codon usage

    Get PDF
    Recombinant binding proteins are becoming increasingly important in various applications, including diagnostics, basic research, and therapeutics. Therefore, the demand for recombinant binding proteins will increase in the future, making it essential to enhance protein production methods. One approach is to improve heterologous expression, in which proteins are produced outside their native hosts. scherichia coli has been a workhorse of heterologous expression for decades due to easy cultivation, cost efficiency, well-known genetics, and compatibility with phage display. However, the production of heterologous proteins in E. coli can sometimes be very difficult. Often the problems in heterologous expression are related to codon usage, which works as a control mechanism of protein translation, especially in bacteria. Different organisms do not use codons in the same manner, and in some cases, a heterologously expressed gene can include codons that are rarely or too frequently used in E. coli, which can disturb the codon usage derived control of translation. Many studies have reported improved yields of heterologous proteins produced in E.coli when the codon usage of the heterologous gene has been recoded to suit better for E. coli. Most often, recombinant binding proteins used in the previously mentioned applications are not native proteins of E. coli either, and many times their production or phage display in E. coli can be very cumbersome. In this thesis, we aimed to improve expression and phage display properties of two essential recombinant binding proteins in E. coli through modulation of codon usage. One binding protein was an antibody fragment called fragment antigen binding (Fab) and another artificial recombinant binding protein called Designed Ankyrin Repeat Protein (DARPin). In the first publication, the expression and phage display efficiency of a human anti-digoxigenin Fab fragment was improved by codon harmonizing the selected segments of the Fab fragment gene. In the second publication, expression and secretion properties of Fab fragments were improved by selecting enhanced signal sequence variants from PelB signal sequence libraries, which included only codon changes. In the third publication, the signal sequence libraries were used to enhance Sec dependent phage display of an anti-GFP DARPin and a DARPin library
    corecore