    Functional genomics, analysis of adaptation in and applications of models to the metabolism of engineered Escherichia coli

    In order to examine the metabolism of bacteria in the genus Enterobacteriaceae tools for gene complement comparison and stoichiometric model building have been developed to take advantage of both the number of complete bacterial genome sequences currently available and the relationship between genes and metabolism. A functional genomic approach to improving knowledge of the metabolism of Escherichia coli CFT073 (a uropathogen) has been undertaken taking into account not only its genome sequence, but its close relationship to E. coli MG1655. A fresh comparison of E. coli CFT073 has been done with E. coli MG1655 to identify all those genes in CFT073 that are not present in MG1655 and may have metabolic characteristics. These genes have further been bioinformatically assessed to determine whether they might encode enzymes for the metabolism of chemicals commonly found in human urine, and one set of such genes has been experimentally confirmed to encode an L-sorbose utilisation pathway. Little experimental work has been done as yet to elucidate how bacteria adaptively respond to the introduction of heterologous metabolic genes. To investigate how bacteria respond to such DNA, genes encoding the L-sorbose utilisation and uptake operon from CFT073 have been cloned and transformed into DH5 and a selective pressure (minimal medium with L-sorbose as sole carbon source) has been applied over 100 generations of growth of this strain in serial passage to investigate the change in its behaviour. The availability of large numbers of completely sequenced genomes, along with the development of a stoichiometric metabolic model with very high coverage of E. coli metabolism (iAF1260 [1]) have made possible the analysis of the core metabolism of large numbers of bacteria to investigate gene essentiality in these bacteria. A novel way of assessing gene complement has been developed using BLAST and DiagHunter to improve reliability of gene synteny comparisons with contextual information about the genes and to extend work by others to cover all E. coli and Shigella genome sequences with available sequences on GanBank (as of 1st June 2009) in order to bioinformatically investigate essential genes in these bacteria and the heterogeneity of their metabolic networks. Further to this a metabolic model has been constructed for DH5 with an added L-sorbose pathway and for CFT073 and these models have been used to investigate behavioural changes during adaptation of bacteria to novel heterologous genes

    Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach

    The genome of Mycobacterium tuberculosis was analyzed using recently developed computational approaches to infer protein function and protein linkages. We evaluated and employed a method to infer genes likely to belong to the same operon, as judged by the nucleotide distance between genes in the same genomic orientation, and combined this method with those of the Rosetta Stone, Phylogenetic Profile and conserved Gene Neighbor computational methods for the inference of protein function


    Although transcription is one of the most important biological functions of cells, our understanding of its regulation is still limited. In this dissertation, we have studied the transcriptional regulation in prokaryotes in three aspects. First, we investigated the extent to which cis-regulatory elements are conserved during the course of evolution using the LexA regulons in cyanobacteria as an example. We found that in most cyanobacterial genomes analyzed, LexA appears to function as the transcriptional regulator of the key SOS response genes. The loss of lexA in some genomes might lead to the degradation of its binding sites. Second, directional RNA-seq techniques have recently become the workhorse for transcriptome profiling in prokaryotes, however, it is a challenging task to accurately assemble highly labile prokaryotic transcriptomes for further analyses. To fill this gap, we have developed a hidden Markov model based transcriptome assembler which outperforms the state-of-the-art assemblers. Using our tool, we characterized alternative operon structures in E. coli K12 under various growth conditions and growth phases, and found that they are more complex and dynamic than previously anticipated. Lastly, we determined anti-sense and non-coding transcription patterns in E. coli K12 under various growth conditions and time points. We found that a large portion of genes have antisense transcription in a condition-dependent manner. Most antisense transcripts are initiated and restricted to the 5?-end of the gene on the sense strand, and their expression levels are correlated with those of the genes on the sense strand, suggesting that these antisense transcripts might play an important role in transcriptional regulation

    Network-based identification of driver pathways in clonal systems

    Highly ethanol-tolerant bacteria for the production of biofuels, bacterial pathogenes which are resistant to antibiotics and cancer cells are examples of phenotypes that are of importance to society and are currently being studied. In order to better understand these phenotypes and their underlying genotype-phenotype relationships it is now commonplace to investigate DNA and expression profiles using next generation sequencing (NGS) and microarray techniques. These techniques generate large amounts of omics data which result in lists of genes that have mutations or expression profiles which potentially contribute to the phenotype. These lists often include a multitude of genes and are troublesome to verify manually as performing literature studies and wet-lab experiments for a large number of genes is very time and resources consuming. Therefore, (computational) methods are required which can narrow these gene lists down by removing generally abundant false positives from these lists and can ideally provide additional information on the relationships between the selected genes. Other high-throughput techniques such as yeast two-hybrid (Y2H), ChIP-Seq and Chip-Chip but also a myriad of small-scale experiments and predictive computational methods have generated a treasure of interactomics data over the last decade, most of which is now publicly available. By combining this data into a biological interaction network, which contains all molecular pathways that an organisms can utilize and thus is the equivalent of the blueprint of an organisms, it is possible to integrate the omics data obtained from experiments with these biological interaction networks. Biological interaction networks are key to the computational methods presented in this thesis as they enables methods to account for important relations between genes (and gene products). Doing so it is possible to not only identify interesting genes but also to uncover molecular processes important to the phenotype. As the best way to analyze omics data from an interesting phenotype varies widely based on the experimental setup and the available data, multiple methods were developed and applied in the context of this thesis: In a first approach, an existing method (PheNetic) was applied to a consortium of three bacterial species that together are able to efficiently degrade a herbicide but none of the species are able to efficiently degrade the herbicide on their own. For each of the species expression data (RNA-seq) was generated for the consortium and the species in isolation. PheNetic identified molecular pathways which were differentially expressed and likely contribute to a cross-feeding mechanism between the species in the consortium. Having obtained proof-of-concept, PheNetic was adapted to cope with experimental evolution datasets in which, in addition to expression data, genomics data was also available. Two publicly available datasets were analyzed: Amikacin resistance in E. coli and coexisting ecotypes in E.coli. The results allowed to elicit well-known and newly found molecular pathways involved in these phenotypes. Experimental evolution sometimes generates datasets consisting of mutator phenotypes which have high mutation rates. These datasets are hard to analyze due to the large amount of noise (most mutations have no effect on the phenotype). To this end IAMBEE was developed. IAMBEE is able to analyze genomic datasets from evolution experiments even if they contain mutator phenotypes. IAMBEE was tested using an E. coli evolution experiment in which cells were exposed to increasing concentrations of ethanol. The results were validated in the wet-lab. In addition to methods for analysis of causal mutations and mechanisms in bacteria, a method for the identification of causal molecular pathways in cancer was developed. As bacteria and cancerous cells are both clonal, they can be treated similar in this context. The big differences are the amount of data available (many more samples are available in cancer) and the fact that cancer is a complex and heterogenic phenotype. Therefore we developed SSA-ME, which makes use of the concept that a causal molecular pathway has at most one mutation in a cancerous cell (mutual exclusivity). However, enforcing this criterion is computationally hard. SSA-ME is designed to cope with this problem and search for mutual exclusive patterns in relatively large datasets. SSA-ME was tested on cancer data from the TCGA PAN-cancer dataset. From the results we could, in addition to already known molecular pathways and mutated genes, predict the involvement of few rarely mutated genes.nrpages: 246status: publishe


    As a result of recent successes in genome scale studies, especially genome sequencing, large amounts of new biological data are now available. This naturally challenges the computational world to develop more powerful and precise analysis tools. In this work, three computational studies have been conducted, utilizing complete microbial genome sequences: the detection of operons, the composition of protein families, and the detection of the lateral gene transfer events. In the first study, two computational methods, termed the Gene Neighbor Method (GNM) and the Gene Gap Method (GGM), were developed for the detection of operons in microbial genomes. GNM utilizes the relatively high conservation of order of genes in operons, compared with genes in general. GGM makes use of the relatively short gap between genes in operons compared with that otherwise found between adjacent genes. The two methods were benchmarked using biological pathway data and documented operon data. Operons were predicted for 42 microbial genomes. The predictions are used to infer possible functions for some hypothetical genes in prokaryotic genomes and have proven a useful adjunct to structure information in deriving protein function in our structural genomics project. In the second study, we have developed an automated clustering procedure to classify protein sequences in a set of microbial genomes into protein families. Benchmarking shows the clustering method is sensitive at detecting remote family members, and has a low level of false positives. The aim of constructing this comprehensive protein family set is to address several questions key to structural genomics. First, our study indicates that approximately 20% of known families with three or more members currently have a representative structure. Second, the number of apparent protein families will be considerably larger than previously thought: We estimate that, by the criteria of this work, there will be about 250,000 protein families when 1000 microbial genomes are sequenced. However, the vast majority of these families will be small. Third, it will be possible to obtain structural templates for 70 - 80% of protein domains with an achievable number of representative structures, by systematically sampling the larger families. The third study is the detection of lateral gene transfer event in microbial genomes. Two new high throughput methods have been developed, and applied to a set of 66 fully sequenced genomes. Both make use of a protein family framework. In the High Apparent Gene Loss (HAGL) method, the number and nature of gene loss events implied by classical evolutionary descent is analyzed. The higher the number of apparent losses, and the smaller the evolutionary distance over which they must have occurred, the more likely that one or more genes have been transferred into the family. The Evolutionary Rate Anomaly (ERA) method associates transfer events with proteins that appear to have an anomalously low rate of sequence change compared with the rest of that protein family. The methods are complementary in that the HAGL method works best with small families and the ERA method best with larger ones. The methods have been parameterized against each other, such that they have high specificity (less than 10% false positives) and can detect about half of the test events. Application to the full set of genomes shows widely varying amounts of lateral gene transfer

    Graph-based modeling and evolutionary analysis of microbial metabolism

    Microbial organisms are responsible for most of the metabolic innovations on Earth. Understanding microbial metabolism helps shed the light on questions that are central to biology, biomedicine, energy and the environment. Graph-based modeling is a powerful tool that has been used extensively for elucidating the organising principles of microbial metabolism and the underlying evolutionary forces that act upon it. Nevertheless, various graph-theoretic representations and techniques have been applied to metabolic networks, rendering the modeling aspect ad hoc and highlighting the conflicting conclusions based on the different representations. The contribution of this dissertation is two-fold. In the first half, I revisit the modeling aspect of metabolic networks, and present novel techniques for their representation and analysis. In particular, I explore the limitations of standard graphs representations, and the utility of the more appropriate model---hypergraphs---for capturing metabolic network properties. Further, I address the task of metabolic pathway inference and the necessity to account for chemical symmetries and alternative tracings in this crucial task. In the second part of the dissertation, I focus on two evolutionary questions. First, I investigate the evolutionary underpinnings of the formation of communities in metabolic networks---a phenomenon that has been reported in the literature and implicated in an organism's adaptation to its environment. I find that the metabolome size better explains the observed community structures. Second, I correlate evolution at the genome level with emergent properties at the metabolic network level. In particular, I quantify the various evolutionary events (e.g., gene duplication, loss, transfer, fusion, and fission) in a group of proteobacteria, and analyze their role in shaping the metabolic networks and determining the organismal fitness. As metabolism gains an increasingly prominent role in biomedical, energy, and environmental research, understanding how to model this process and how it came about during evolution become more crucial. My dissertation provides important insights in both directions

    Automatically exploiting genomic and metabolic contexts to aid the functional annotation of prokaryote genomes

    Cette thèse porte sur le développement d'approches bioinformatiques exploitant de l'information de contextes génomiques et métaboliques afin de générer des annotations fonctionnelles de gènes prokaryotes, et comporte deux projets principaux. Le premier projet focalise sur les activités enzymatiques orphelines de séquence. Environ 27% des activités définies par le International Union of Biochemistry and Molecular Biology sont encore aujourd'hui orphelines. Pour celles-ci, les méthodes bioinformatiques traditionnelles ne peuvent proposer de gènes candidats; il est donc impératif d'utiliser des méthodes exploitant des informations contextuelles dans ces cas. La stratégie CanOE (fishingCandidate genes for Orphan Enzymes) a été développée et rajoutée à la plateforme MicroScope dans ce but, intégrant des informations génomiques et métaboliques sur des milliers d'organismes prokaryotes afin de localiser des gènes probants pour des activités orphelines. Le projet miroir au précédent est celui des protéines de fonction inconnue. Un projet collaboratif a été initié au Genoscope afin de formaliser les stratégies d'exploration des fonctions de familles protéiques prokaryotes. Une version pilote du projet a été mise en place sur la famille DUF849 dont une fonction enzymatique avait été récemment découverte. Des stratégies de proposition d'activités enzymatiques alternatives et d'établissement de sous familles isofonctionnelles ont été mises en place dans le cadre de cette thèse, afin de guider les expérimentations de paillasse et d'analyser leurs résultats.The subject of this thesis concerns the development of bioinformatic strategies exploiting genomic and metabolic contextual information in order to generate functional annotations for prokaryote genes. Two main projects were involved during this work: the first focuses on sequence-orphan enzymatic activities. Today, roughly 27% of activities defined by International Union of Biochemistry and Molecular Biology are sequence-orphans. For these, traditional bioinformatic approaches cannot propose candidate genes. It is thus imperative to use alternative, context-based approaches in such cases. The CanOE strategy fishing Candidate genes for Orphan Enzymes) was developed and added to the MicroScope bioinformatics platform in this aim. It integrates genomic and metabolic information across thousands of prokaryote genomes in order to locate promising gene candidates for orphan activities. The mirror project focuses on protein families of unknown function. A collaborative project has been set up at the Genoscope in hope of formalising functional exploration strategies for prokaryote protein families. A pilot version was created on the DUF849 Pfam family, for which a single activity had recently been elucidated. Strategies for proposing novel functions and activities and creating isofunctional sub-families were researched, so as to guide biochemical experimentations and to analyse their results.EVRY-Bib. électronique (912289901) / SudocSudocFranceF

    The Proximon: Representation, Evaluation, and Applications of Metagenomic Functional Interactions

    The effective use of metagenomic functional interactions represents a key prospect for a variety of applications in the field of functional metagenomics. By definition, metagenomic operons represent such interactions but many operon predictions protocols rely on information about orthology and/or gene function that is frequently unavailable for metagenomic genes. In this thesis, I introduce the proposition of the proximon as a unit of functional interaction that is intended for use in metagenomic scenarios where supplemental information is sparse. The proximon is defined as a series of co-directional genes where minimal intergenic distance exists between any two consecutive member genes within the same proximon. In particular, the proximon is presented here as a biological abstraction aimed at facilitating bioinformatics and computational goals. In this thesis, proximons are constructed as information theoretic entities and employed in a variety of contexts related to functional metagenomics. I begin by implementing a computational representation for proximon data and demonstrate its utility through the deployment of a public database. Next, I perform a formal validation where proximons are contrasted against known operons by using the Escherichia coli K-12 model organism as a gold standard to measure the extent to which proximons emulate actual operons. This is followed by a demonstration of how proximon data can be applied to infer potential functional networks and depict potential functional modules. I conclude by enumerating the limitations of the research performed here and I present objectives and goals for future work

    Bacterial inter-species communication mediated by the autoinducer-2 signal

    Dissertation presented to obtain the Ph.D degree in Biology by Universidade Nova de Lisboa, Instituto de Tecnologia Química e Biológica, Instituto Gulbenkian de Ciência.During the last few decades, scientists have come to appreciate the immense complexity in bacterial signaling interactions that sustain microbial communities. Quorum-sensing (QS) is a cell-cell communication process whereby single cell bacteria regulate gene expression synchronously in a population in response to self-produced extracellular signal molecules, called autoinducers. Autoinducer-2 (AI-2), the synthase of which, LuxS, is present in both Gram-negative and Gram-positive bacteria, was proposed to represent a non-species-specific signal that mediates inter-species communication. In enteric bacteria, extracellular AI-2 levels peak in late exponential phase and rapidly decline as bacteria continue to grow. This depletion occurs because AI-2 activates the expression of an operon, lsr (for LuxS Regulated), encoding the Lsr transporter and enzymes that degrade the signal. As the Lsr system imports self and non-self AI-2, lsr-containing bacteria can interfere with AI-2 signaling of other species and shut off group behaviors regulated by this molecule: this system represents the first example of interference with a bacterial inter-species QS signal.(...)Fundação para a Ciência e Tecnologia financial support with the grant SFRH / BD / 28543 / 2006