510 research outputs found

    Whole-Genome-Based Phylogeny and Taxonomy for Prokaryotes

    Get PDF
    A faithful prokaryotic phylogeny should be inferred from genomic data and phylogeny determines taxonomy. The ever-growing amount of sequenced genomes makes this approach feasible and practical. Whole-genome phylogeny must be based on alignment-free methodology and should be verified by direct comparison with taxonomy at all ranks from domains down to species. When the number of genomes goes into tens of thousands, the realization of the above program also presents technical challenges. The power of a long-tested Web Server named Composition Vector Tree (CVTree) will be demonstrated on examples from mega-classification of bacteria to high resolution at and below the species level

    대규모 유전체 분석을 통한 대장균의 유전체 다양성 및 진화 분석

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 생명과학부, 2016. 8. 천종식.Bacterial evolution is driven by enormous genomic diversity present in the populations. Genomic diversity of a bacterial population is generated and maintained by the compounded influences of several microevolutionary mechanisms. Uniqueness of bacterial genome evolution originates from the mixture of vertical and horizontal heredity. As a result the dynamics of bacterial genomes within a species exhibits both the characters of clonal and sexual genetics, and impressively, seemingly unlimited genomic repertoire could be achieved by a single species. The course and consequences of genomic diversification within bacterial species have not been fully understood. Because of extensive genomic diversity within a species, understanding of the genomic evolution within bacterial species requires large scale exploratory and descriptive studies as well as explanatory studies based on the working hypothesis on how genomes evolve. A well-known laboratory model organism Escherichia coli has been shown to exploit highly diverse ecological niches in its natural population. Genomic studies of E. coli indeed revealed significant genome dynamism accompanied by ecological diversification. E. coli includes several types of pathogens that have exerted severe global burden of enteric diseases, and by that reason, whole genome surveys have been active for this species. At this point of time more than four thousands of E. coli genome sequences from genetically diverse strains have become available. Therefore E. coli constitutes an ideal model for studies of intra-specific genomic evolution of bacteria. In this thesis, multiple aspects of the genomic diversity of E. coli were explored and described by comparative analysis of 3,945 genome sequences of the strains belonging to the genus Escherichia. In addition the roles played by distinct microevolutionary mechanisms in the shaping of current structure of genomic diversity were assessed. Lastly a broader perspective on the evolution of E. coli genomes was achieved by analyzing the evolutionary history of E. coli and its closest relatives. Exploration of the genomic diversity of E. coli was conducted in 4 aspects, by analyses of pan-genome size, sequence diversity, structural diversity and phylogenetic diversity. Openness of E. coli pan-genome was indicated from the analysis of 3,909 E. coli strains. Comparison between the phylogenetic diversity and the pan-genome size estimated for randomly selected subsets of the strains showed a linear relationship between the two values. Counter-intuitively the relative ratio of pan-genome size growth over the increment of phylogenetic diversity was higher in the phylogenetic groups of E. coli than for the entire species. Seeking for the reason behind this trend comprised a major theoretical motivation of this thesis. Sequence diversity of E. coli core genes had a unimodal distribution with 1.3% as the modal value. The core gene order was unexpectedly well conserved among E. coli genomes and the presence of clonal frame was supported by the linkage analysis, both indicating that the core-genome of E. coli was highly stable. An emerging conclusion from the analysis of genomic diversity was that the paces of gene contents diversification and gene sequence diversification can be uncoupled. Based on whole genome scale phylogenetic analysis the phylogenetic structure was clearly present among the strains of E. coli. The nature of given phylogenetic structuring of E. coli population was another major theoretical motivation of this thesis. Increased inter-SNP linkage within the phylogenetic groups provided a clue that each phylogenetic group has relatively elevated clonality, while recombination rates in the ancestral population of E. coli were higher than the current rates. Assumption of clonality within phylogenetic groups could provide an explanation for the observed higher rate of within-group pan-genome growth rate per phylogenetic diversity expansion. Increased clonality is expected to result in increased efficiency of selective sweep caused by positive selection, thus resulting in the destruction and delay of sequence diversification. Inferences of recombination history in the core-genome of E. coli identified that 0.78% - 4.1% of the DNA segments in the core-genome has been replaced by homologous recombination. Among the extant lineages of E. coli the relative impact of recombination over mutation in the changes introduced to DNA sequences was distributed around 0.6 – 0.8. Relatively recent branches showed lower R/Theta than the ancestral branches, implying historical decline of recombinations influence. This direct observation of temporal decline of recombination supported the hypothesis of E. colis shifting toward clonality. In the pan-genome of E. coli the singleton genes that occurred in just a single strain of E. coli could be originated from recent horizontal gene transfer or recent duplication. About half of the singleton genes could not be matched to any other genes in the current prokaryotic genome database. For about 10% of the E. coli singleton genes, highly similar proteins were found in diverse taxonomic divisions. Most frequently the best hits resided in the close relatives of E. coli in the Enterobacteriaceae family. However, distant taxa in other phyla, especially the Firmicutes, contributed significant amount of best hits, implying that those microbes share the common environmental gene pool with natural E. coli population. Predominant direction of natural selection in E. coli genes were shown to be negative selection, which suppresses the diversification of sequences. Strength of negative selection was stronger in the core-genome in comparison to the genes with lower gene frequency. Despite that negative selection was dominant across all gene frequency spectrum, some genes exhibited dN/dS larger than 1 and seemed to be positively selected. Transposases comprised the largest proportions of positively selected genes. Multiple genes involved in flagellar biosynthesis were detected to be positively selected or have been under relaxed negative selection. Based on the phylogenetic analysis of 21 genera in Enterobacteriaceae using their core-genome, the diversification within Enterobacteriaceae was characterized by the pattern of radiation and extensively conflicting phylogenetic signals at the basal area. Such ambiguity at deep branches were also observed for phylogenetic networks within the genus Escherichia. Temporally fragmented speciation might be supported by the observation. In attempt to resolve the divergence order between the species in Escherichia, Bayesian multi species coalescent analysis was carried out using 3 gene sets each composed of 60 core genes. The reconciled species tree and the collective graph of the coalescences estimated by the gene set re-confirmed that the divergence order between Escherichia spp. are ambiguous in reality. To add the geological time-scale information to the knowledge about E. coli evolution, a time-tree analysis was performed on the core-genome and the previously estimated divergence time of E. coli. By extending the previously known divergence time between E. coli and Salmonella enterica the age of Escherichia was shown to be between 37.9 – 40 MYA. The age of E. coli was estimated to be between 16.6 – 17.7 if the clade I was excluded from E. coli and 25.9 – 26.9 MYA if the clade I was included in E. coli. The obscurity of phylogenetic scenario for the origin of Shigella pathogens within E. coli was tackled by the comparison between multigene phylogeny of Shigella virulence plasmids and the chromosomal phylogeny. At least five independent plasmid acquisition events had to be assumed to explain the incongruence between the two phylogenies. According to the results obtained in this study, population genetics of E. coli went through a transition from relatively sexual global population to relatively clonal sub-populations. Such a transition can provide the basis for the presence of phylogenetic structure, which is not common in bacterial species. Strong clonality was shown to have negative association with the genetic diversity of species, and the slowed sequence diversification due to the reduced recombination might be the reason for increased pan-genome growth rate per phylogenetic diversity in the phylogenetic groups of E. coli. As shown in the example of E. coli, bacterial genome evolution is affected by complex interplay between evolutionary mechanisms, and moreover, can be shifted in the course of intra-specific evolution. Therefore, the nature and concept of species and speciation in bacteria could be variable from species to species, and from time to time.CHAPTER 1. General introduction 1 1.1. Bacterial genome evolution 2 1.2. Escherichia coli 9 1.3. Purposes and organization of this study 13 CHAPTER 2. Analysis of intra-specific genomic diversity of E. coli represented in the genome dataset 15 2.1. Introduction 16 2.2. Materials and methods 20 2.2.1. Newly sequenced E. coli genomes and the genome data obtained from public databases 20 2.2.2. Taxonomic identification, annotation of protein-coding genes and clustering of orthologous proteins 25 2.2.3. Pan-genome statistics 27 2.2.4. Phylogenetic analysis 29 2.2.5. Population structure inference using core single nucleotide polymorphisms 30 2.2.6. Analysis of gene contents variation, gene order conservation and genome-wide linkage between SNP sites 31 2.3. Results 33 2.3.1. Basic characterization of the genomes data 33 2.3.2. Open pan-genome of E. coli 40 2.3.3. Statistical analysis of pan-genome gene frequency distribution 47 2.3.4. Evolutionary rate of pan-genome growth 53 2.3.5. Phylogenetic and population genetic structure inferred from genome data 58 2.3.6. Intra-specific sequence diversity in the pan-genome of E. coli 65 2.3.7. Analysis of gene content variation 69 2.3.8. Conservation of synteny and linkage over long distance 75 2.3.9. Comparison of E. coli pan-genome properties and phylogenetic structure with those of other bacterial species 80 2.4. Discussion 87 CHAPTER 3. Characterization of microevolutionary processes that mediated genomic diversification of E. coli 93 3.1. Introduction 94 3.2. Materials and methods 97 3.2.1. Genome dataset 97 3.2.2. Analysis of homologous recombination events 98 3.2.3. Analysis of gene gain and loss history and tracking the origins of the singleton genes in E. coli pan-genome 100 3.2.4. Analysis of dN/dS ratio 102 3.3. Results 103 3.3.1. Impact of homologous recombination in genomic evolution of E. coli 103 3.3.2. Impact of gene gain and loss in the genomic evolution of E. coli and the origins of recently gained genes 119 3.3.3. Analysis of the signs of natural selection in the pan-genome of E. coli 128 3.4. Discussion 136 CHAPTER 4. Systematics study of E. coli and related taxa 143 4.1. Introduction 144 4.1.1. Timed history of bacterial evolution 144 4.1.2. Obscurities in the systematics of E. coli 147 4.2. Materials and methods 149 4.2.1. Reconstruction of Enterobacteriaceae phylogeny 149 4.2.2. Molecular clock analysis and species tree analysis of Escherichia 151 4.2.3. Reconstruction of Shigella virulence plasmid phylogeny 153 4.2.4. Reconstruction of rut and phn operon phylogenies 155 4.3. Results 156 4.3.1. Phylogenomic analysis of the evolutionary relationships of Enterobacteriaceae species 156 4.3.2. Molecular chronology of E. coli 168 4.3.3. Phylogenetic scenario for Shigella spp 170 4.3.4. Genes that distinguished E. coli from other Escherichia spp 175 4.4. Discussion 181 CHAPTER 5. Conclusions 189 REFERENCES 197 국문 초록 219Docto

    Molecular and ecological characterisation of Escherichia coli from plants

    Get PDF
    Abstract Escherichia coli is routinely isolated from vegetables and there is increasing evidence that plants are a secondary reservoir for commensal and pathogenic strains, but the ecological factors involved in the persistence of E. coli on plants are not clear. In this thesis, a comparative study was undertaken combining phenotypic and phylogenetic analyses of E. coli isolates from salads grown in the UK and the faeces of mammalian hosts. In vitro phenotypic profiling revealed significant differences according to the source of isolation: strains from plants were in the majority from phylogroup B1, displayed lower siderophore production, greater motility, higher biofilm production, and better growth on the aromatic compounds and sucrose. However, plant-associated isolates reached lower growth yields on many carbon sources, including several amino acids and common carbohydrates such as glucose and mannitol. The data obtained indicate that in addition to lateral gene transfer, variation (regulation or uptake) in core metabolic functions plays an important role in E. coli ecological adaptation. When the discriminating phenotypes were combined to generate a plant association index (PAi) to rank strains according to their potential to persist on plants, a strong association between PAi and phylogeny was found, notably high levels in phylogroup B1 and low levels in phylogroup B2 which could potentially constitute a good predictor for host specialisation and generalisation in E. coli. As a more applied and preliminary investigation, the question of how a strain with a medium level of PAi (GMB30) can influence the resident microflora of field- and laboratory-grown spinach was also addressed. Overall, this study shows that despite frequent acquisition and loss of traits associated with nonhost environments, the E. coli phylogroups differ substantially in their transmission ecology, and in the adaptation levels to their host

    Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Escherichia coli</it> exists in commensal and pathogenic forms. By measuring the variation of individual genes across more than a hundred sequenced genomes, gene variation can be studied in detail, including the number of mutations found for any given gene. This knowledge will be useful for creating better phylogenies, for determination of molecular clocks and for improved typing techniques.</p> <p>Results</p> <p>We find 3,051 gene clusters/families present in at least 95% of the genomes and 1,702 gene clusters present in 100% of the genomes. The former 'soft core' of about 3,000 gene families is perhaps more biologically relevant, especially considering that many of these genome sequences are draft quality. The <it>E. coli</it> pan-genome for this set of isolates contains 16,373 gene clusters.</p> <p>A core-gene tree, based on alignment and a pan-genome tree based on gene presence/absence, maps the relatedness of the 186 sequenced <it>E. coli</it> genomes. The core-gene tree displays high confidence and divides the <it>E. coli</it> strains into the observed MLST type clades and also separates defined phylotypes.</p> <p>Conclusion</p> <p>The results of comparing a large and diverse <it>E. coli</it> dataset support the theory that reliable and good resolution phylogenies can be inferred from the core-genome. The results further suggest that the resolution at the isolate level may, subsequently be improved by targeting more variable genes. The use of whole genome sequencing will make it possible to eliminate, or at least reduce, the need for several typing steps used in traditional epidemiology.</p

    What Can We Learn from a Metagenomic Analysis of a Georgian Bacteriophage Cocktail?

    Get PDF
    Phage therapy, a practice widespread in Eastern Europe, has untapped potential in the combat against antibiotic-resistant bacterial infections. However, technology transfer to Western medicine is proving challenging. Bioinformatics analysis could help to facilitate this endeavor. In the present study, the Intesti phage cocktail, a key commercial product of the Eliava Institute, Georgia, has been tested on a selection of bacterial strains, sequenced as a metagenomic sample, de novo assembled and analyzed by bioinformatics methods. Furthermore, eight bacterial host strains were infected with the cocktail and the resulting lysates sequenced and compared to the unamplified cocktail. The analysis identified 23 major phage clusters in different abundances in the cocktail, among those clusters related to the ICTV genera T4likevirus, T5likevirus, T7likevirus, Chilikevirus and Twortlikevirus, as well as a cluster that was quite distant to the database sequences and a novel Proteus phage cluster. Examination of the depth of coverage showed the clusters to have different abundances within the cocktail. The cocktail was found to be composed primarily of Myoviridae (35%) and Siphoviridae (32%), with Podoviridae being a minority (15%). No undesirable genes were found

    Generalizations of the genomic rank distance to indels

    Get PDF
    MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    GO4genome: A Prokaryotic Phylogeny Based on Genome Organization

    Get PDF
    Determining the phylogeny of closely related prokaryotes may fail in an analysis of rRNA or a small set of sequences. Whole-genome phylogeny utilizes the maximally available sample space. For a precise determination of genome similarity, two aspects have to be considered when developing an algorithm of whole-genome phylogeny: (1) gene order conservation is a more precise signal than gene content; and (2) when using sequence similarity, failures in identifying orthologues or the in situ replacement of genes via horizontal gene transfer may give misleading results. GO4genome is a new paradigm, which is based on a detailed analysis of gene function and the location of the respective genes. For characterization of genes, the algorithm uses gene ontology enabling a comparison of function independent of evolutionary relationship. After the identification of locally optimal series of gene functions, their length distribution is utilized to compute a phylogenetic distance. The outcome is a classification of genomes based on metabolic capabilities and their organization. Thus, the impact of effects on genome organization that are not covered by methods of molecular phylogeny can be studied. Genomes of strains belonging to Escherichia coli, Shigella, Streptococcus, Methanosarcina, and Yersinia were analyzed. Differences from the findings of classical methods are discussed

    The diversity and structure of Escherichia coli populations in fresh water environments

    Get PDF
    Escherichia coli is a well known commensal inhabitant of the gastrointestinal tract of both humans and animals and a highly diverse species. The physiology, biochemistry and genetics of E. coli have been studied extensively over many decades. However, these studies have focussed predominately on the pathogenic and commensal isolates. It has been described that E. coli typically exists in two environments, the primary environment being the gastrointestinal tract of the host and the secondary environment being that environment outside of the host (water, soil and sediments). Upon introduction into the environment outside of the host, the numbers of E. coli steadily decline. Generally, where E. coli is present in the external environment and where its numbers are maintained it is due to a constant direct faecal input from the host. This short lifespan in the environment outside of the host forms the basis for the use of E. coli as an indicator organism for faecal contamination in water systems. In contrast, multiple studies have shown that some E. coli strains have the ability to survive and persist in the external environment in the absence of faecal input from the host. With a large pan-genome and the possibility of horizontal gene transfer (HGT) of desirable traits, E. coli have the potential to adapt to a variety of different niches overcoming drastic changes in conditions in its new environment. In addition, adaptation to the secondary environment is facilitated by the presence of soils and sediments, where in an aquatic environment they provide a source of nutrients and protection from the drastic change in conditions. Here, E. coli has the ability to occupy a new niche and become naturalised within an aquatic environment. The aim of this masters project was to examine and characterise the diversity of E. coli isolates collected from two South African freshwater environments namely, the Roodeplaat and Rietvlei Dams, Pretoria. Specific research questions addressed in this study include: (1) are their unique and genetically differentiated sub-populations within the aquatic environments sample? (2) Is there a link between the unique sub-populations and their sample site? (3) Finally, what is the relationship between sub-populations in terms of gene flow and population structure? Understanding E. coli’s population structure and ecology may shed some light on its evolution and potential to adapt to new environments. Following phylogrouping, AFLP and phylogenetic analysis of the rpoS and uidA genes, the results indicated that the population was highly diverse with the majority of strains grouping together with the sewage isolates. Furthermore, population structure analyses concentrating on gene flow and genetic differentiation revealed that possible environmental groups exist within the population. In particular, two groups of E. coli isolates associated with aquatic plants showed restricted gene flow and definite genetic differentiation. These two groups can also be observed in the rpoS and uidA phylogenetic analyses where they consistently group together in the absence of sewage isolates. These findings demonstrate that some E. coli are not only able to survive outside of their host but have undergone some level of niche separation within the secondary environment. These results raise important questions into the accuracy of using E. coli as an indicator organism. In the long term, this study may aid in understanding the population dynamics of E. coli and the implications of environmental strains on using E. coli in assessing water quality.Dissertation (MSc)--University of Pretoria, 2013.NRFMicrobiology and Plant PathologyMScUnrestricte
    corecore