47 research outputs found

    Accuracy and data efficiency in deep learning models of protein expression

    Get PDF
    Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector

    Prevalence of SOS-mediated control of integron integrase expression as an adaptive trait of chromosomal and mobile integrons

    Get PDF
    Background: Integrons are found in hundreds of environmental bacterial species, but are mainly known as the agents responsible for the capture and spread of antibiotic-resistance determinants between Gram-negative pathogens. The SOS response is a regulatory network under control of the repressor protein LexA targeted at addressing DNA damage, thus promoting genetic variation in times of stress. We recently reported a direct link between the SOS response and the expression of integron integrases in Vibrio cholerae and a plasmid-borne class 1 mobile integron. SOS regulation enhances cassette swapping and capture in stressful conditions, while freezing the integron in steady environments. We conducted a systematic study of available integron integrase promoter sequences to analyze the extent of this relationship across the Bacteria domain. Results: Our results showed that LexA controls the expression of a large fraction of integron integrases by binding to Escherichia coli-like LexA binding sites. In addition, the results provide experimental validation of LexA control of the integrase gene for another Vibrio chromosomal integron and for a multiresistance plasmid harboring two integrons. There was a significant correlation between lack of LexA control and predicted inactivation of integrase genes, even though experimental evidence also indicates that LexA regulation may be lost to enhance expression of integron cassettes. Conclusions: Ancestral-state reconstruction on an integron integrase phylogeny led us to conclude that the ancestral integron was already regulated by LexA. The data also indicated that SOS regulation has been actively preserved in mobile integrons and large chromosomal integrons, suggesting that unregulated integrase activity is selected against. Nonetheless, additional adaptations have probably arisen to cope with unregulated integrase activity. Identifying them may be fundamental in deciphering the uneven distribution of integrons in the Bacteria domain

    The synthetic integron: an in vivo genetic shuffling device

    Get PDF
    As the field of synthetic biology expands, strategies and tools for the rapid construction of new biochemical pathways will become increasingly valuable. Purely rational design of complex biological pathways is inherently limited by the current state of our knowledge. Selection of optimal arrangements of genetic elements from randomized libraries may well be a useful approach for successful engineering. Here, we propose the construction and optimization of metabolic pathways using the inherent gene shuffling activity of a natural bacterial site-specific recombination system, the integron. As a proof of principle, we constructed and optimized a functional tryptophan biosynthetic operon in Escherichia coli. The trpA-E genes along with ‘regulatory’ elements were delivered as individual recombination cassettes in a synthetic integron platform. Integrase-mediated recombination generated thousands of genetic combinations overnight. We were able to isolate a large number of arrangements displaying varying fitness and tryptophan production capacities. Several assemblages required as many as six recombination events and produced as much as 11-fold more tryptophan than the natural gene order in the same context

    Recoding of synonymous genes to expand evolutionary landscapes requires control of secondary structure affecting translation

    Get PDF
    Synthetic DNA design needs to harness the many information layers embedded in a DNA string. We previously developed the Evolutionary Landscape Painter (ELP), an algorithm that exploits the degeneracy of the code to increase protein evolvability. Here, we have used ELP to recode the integron integrase gene (intI1) in two alternative alleles. Although synonymous, both alleles yielded less IntI1 protein and were less active in recombination assays than intI1. We spliced the three alleles and mapped the activity decrease to the beginning of alternative sequences. Mfold predicted the presence of more stable secondary structures in the alternative genes. Using synonymous mutations, we decreased their stability and recovered full activity. Following a design-build-test approach, we have now updated ELP to consider such structures and provide streamlined alternative sequences. Our results support the possibility of modulating gene activity through the ad hoc design of 5′ secondary structures in synthetic genes

    Synonymous Genes Explore Different Evolutionary Landscapes

    Get PDF
    The evolutionary potential of a gene is constrained not only by the amino acid sequence of its product, but by its DNA sequence as well. The topology of the genetic code is such that half of the amino acids exhibit synonymous codons that can reach different subsets of amino acids from each other through single mutation. Thus, synonymous DNA sequences should access different regions of the protein sequence space through a limited number of mutations, and this may deeply influence the evolution of natural proteins. Here, we demonstrate that this feature can be of value for manipulating protein evolvability. We designed an algorithm that, starting from an input gene, constructs a synonymous sequence that systematically includes the codons with the most different evolutionary perspectives; i.e., codons that maximize accessibility to amino acids previously unreachable from the template by point mutation. A synonymous version of a bacterial antibiotic resistance gene was computed and synthesized. When concurrently submitted to identical directed evolution protocols, both the wild type and the recoded sequence led to the isolation of specific, advantageous phenotypic variants. Simulations based on a mutation isolated only from the synthetic gene libraries were conducted to assess the impact of sub-functional selective constraints, such as codon usage, on natural adaptation. Our data demonstrate that rational design of synonymous synthetic genes stands as an affordable improvement to any directed evolution protocol. We show that using two synonymous DNA sequences improves the overall yield of the procedure by increasing the diversity of mutants generated. These results provide conclusive evidence that synonymous coding sequences do experience different areas of the corresponding protein adaptive landscape, and that a sequence's codon usage effectively constrains the evolution of the encoded protein

    Massive factorial design untangles coding sequences determinants and phenotypic consequences of translation efficacy

    No full text
    Translation plays a crucial role in the cellular economy. The translation rate of particular transcripts obviously determines the eventual production rate of corresponding proteins. Being very costly, translation is both deeply contributing and affected by the global physiological state of the cell. A better understanding of the factors governing translation efficiency is therefore of tremendous fundamental and applied importance. Although plethora of sequence determinants of translation efficiency have been proposed, most are highly confounded properties arising from the same underlying sequence. No systemic study has purposely sought to unravel the relative contributions of such intertwined factors to eventual protein production and their effect on the cell. We have applied a novel sequence design platform and scaling DNA synthesis capacities to systematically explore combinations of various compositional biases (nucleotide, codon and amino acid) and mRNA secondary structures, thereby implementing a thorough molecular Design of Experiment in different focal regions of the sequence space. After cloning in a standard reporter system, we used multiplexed deep sequencing approaches to quantify the consequences of such sequence variations in a high-throughput manner. We thus measured mRNA abundance and decay, ribosomal densities, protein production and growth rates. Functional analysis of 244,000 precisely designed coding sequences in E. coli uncovers the overwhelming dominance of secondary structures in controlling translation initiation and elongation, as well as transcript stability. We identify a moderate role for codon usage in modulating elongation and ribosome loadings when the impact of secondary structures in limiting initiation is lifted by way of translational coupling. Beyond the cost of protein biosynthesis, we observed little effect of codon usage on cellular growth rates. Instead, we find that highly structured, slow initiating transcripts are exceedingly stabilized and can poison cells through unproductive ribosome entrapment

    Mémoire Habilitation à Diriger des Recherches

    No full text

    Reprogramming Viral Host Specificity To Control Insect Populations

    No full text
    One of the most diverse and successful group of animals, Insects are an integral part of ecosystems. Yet, some represent great nuisances for Human’s health and development...Such pests have been efficiently controlled using chemical insecticides, but the rise of resistances, the broadly untargeted environmental impacts and the increasing recognition of chronic toxicity call for the urgent development of safer and cleaner alternatives.Biological control strategies that take advantage of natural antogonistic relationships between existing organisms and a target pest have been around for millenia. In spite of the inherent risks of unintended side effects, these approaches have recently gained renewed interest.Perhaps because they evoke greater fears, surpisingly few microorganisms have been used in that perspective.Densoviruses are small viruses capable—as a group—of infecting a broad range of insects with various degree of specificity.Their minute genomes comprise a handful of genes, which lend themselves to in-depth molecular dissection using synthetic biology approaches.Our goal is to develop the tools and knowledge necessary to enable the use of densoviruses as safe, specific and efficient biocontrol agents. We focus on JcDV, which infects crop-devasting caterpillars and AalDV, which infects disease-vector mosquitoes.Here, I present early efforts to systematically unravel the structural motifs responsible for capsid specificity.The capsid of densoviruses are small (19-24 nm) non-enveloped icosahedrons (T=1) resulting from the self-assembly of 60 identical or highly similar capsid proteins. The DNA sequences coding these proteins represent roughly a third of the genome and are the prime determinant of specificity. I am using the genome of JcDV to setup a the high-throughput, cost-effective pipeline to deconstruct the phenotypic consequences of many precise capsid mutations. This will permit to better understand natural variations, to map evolutionary landscape, to discover uselful properties and to learn the rules to reprogram specificities

    Reprogramming Viral Host Specificity To Control Insect Populations

    No full text
    One of the most diverse and successful group of animals, Insects are an integral part of ecosystems. Yet, some represent great nuisances for Human’s health and development... Such pests have been efficiently controlled using chemical insecticides, but the rise of resistances, the broadly untargeted environmental impacts and the increasing recognition of chronic toxicity call for the urgent development of safer and cleaner alternatives. Biological control strategies that take advantage of natural antogonistic relationships between existing organisms and a target pest have been around for millenia. In spite of the inherent risks of unintended side effects, these approaches have recently gained renewed interest. Perhaps because they evoke greater fears, surpisingly few microorganisms have been used in that perspective. Densoviruses are small viruses capable—as a group—of infecting a broad range of insects with various degree of specificity. Their minute genomes comprise a handful of genes, which lend themselves to in-depth molecular dissection using synthetic biology approaches. Our goal is to develop the tools and knowledge necessary to enable the use of densoviruses as safe, specific and efficient biocontrol agents. We focus on JcDV, which infects crop-devasting caterpillars and AalDV, which infects disease-vector mosquitoes. Here, I present early efforts to systematically unravel the structural motifs responsible for capsid specificity. The capsid of densoviruses are small (19-24 nm) non-enveloped icosahedrons (T=1) resulting from the self-assembly of 60 identical or highly similar capsid proteins. The DNA sequences coding these proteins represent roughly a third of the genome and are the prime determinant of specificity. I am using the genome of JcDV to setup a the high-throughput, cost-effective pipeline to deconstruct the phenotypic consequences of many precise capsid mutations. This will permit to better understand natural variations, to map evolutionary landscape, to discover uselful properties and to learn the rules to reprogram specificities
    corecore