860 research outputs found

    A high-quality bonobo genome refines the analysis of hominid evolution

    Get PDF
    The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3,4,5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome

    A High-Quality Bonobo Genome Refines The Analysis Of Hominid Evolution

    Get PDF
    The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation(1,2). Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes(1,3-5) and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome

    Reference genome and comparative genome analysis for the WHO reference strain for Mycobacterium bovis BCG Danish, the present tuberculosis vaccine

    Get PDF
    Background: Mycobacterium bovis bacillus Calmette-Guerin (M. bovis BCG) is the only vaccine available against tuberculosis (TB). In an effort to standardize the vaccine production, three substrains, i.e. BCG Danish 1331, Tokyo 172-1 and Russia BCG-1 were established as the WHO reference strains. Both for BCG Tokyo 172-1 as Russia BCG-1, reference genomes exist, not for BCG Danish. In this study, we set out to determine the completely assembled genome sequence for BCG Danish and to establish a workflow for genome characterization of engineering-derived vaccine candidate strains.ResultsBy combining second (Illumina) and third (PacBio) generation sequencing in an integrated genome analysis workflow for BCG, we could construct the completely assembled genome sequence of BCG Danish 1331 (07/270) (and an engineered derivative that is studied as an improved vaccine candidate, a SapM KO), including the resolution of the analytically challenging long duplication regions. We report the presence of a DU1-like duplication in BCG Danish 1331, while this tandem duplication was previously thought to be exclusively restricted to BCG Pasteur. Furthermore, comparative genome analyses of publicly available data for BCG substrains showed the absence of a DU1 in certain BCG Pasteur substrains and the presence of a DU1-like duplication in some BCG China substrains. By integrating publicly available data, we provide an update to the genome features of the commonly used BCG strains. Conclusions: We demonstrate how this analysis workflow enables the resolution of genome duplications and of the genome of engineered derivatives of the BCG Danish vaccine strain. The BCG Danish WHO reference genome will serve as a reference for future engineered strains and the established workflow can be used to enhance BCG vaccine standardization

    Computational methods to improve genome assembly and gene prediction

    Get PDF
    DNA sequencing is used to read the nucleotides composing the genetic material that forms individual organisms. As 2nd generation sequencing technologies offering high throughput at a feasible cost have matured, sequencing has permeated nearly all areas of biological research. By a combination of large-scale projects led by consortiums and smaller endeavors led by individual labs, the flood of sequencing data will continue, which should provide major insights into how genomes produce physical characteristics, including disease, and evolve. To realize this potential, computer science is required to develop the bioinformatics pipelines to efficiently and accurately process and analyze the data from large and noisy datasets. Here, I focus on two crucial bioinformatics applications: the assembly of a genome from sequencing reads and protein-coding gene prediction. In genome assembly, we form large contiguous genomic sequences from the short sequence fragments generated by current machines. Starting from the raw sequences, we developed software called Quake that corrects sequencing errors more accurately than previous programs by using coverage of k-mers and probabilistic modeling of sequencing errors. My experiments show correcting errors with Quake improves genome assembly and leads to the detection of more polymorphisms in re-sequencing studies. For post-assembly analysis, we designed a method to detect a particular type of mis-assembly where the two copies of each chromosome in diploid genomes diverge. We found thousands of examples in each of the chimpanzee, cow, and chicken public genome assemblies that created false segmental duplications. Shotgun sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to both discover unknown microbes and explore complex environments. We developed software called Scimm that clusters metagenomic sequences based on composition in an unsupervised fashion more accurately than previous approaches. Finally, we extended an approach for predicting protein-coding genes on whole genomes to metagenomic sequences by adding new discriminative features and augmenting the task with taxonomic classification and clustering of the sequences. The program, called Glimmer-MG, predicts genes more accurately than all previous methods. By adding a model for sequencing errors that also allows the program to predict insertions and deletions, accuracy significantly improves on error-prone sequences

    Characterisation of pathogen-specific regions and novel effector candidates in Fusarium oxysporum f. sp. cepae

    Get PDF
    A reference-quality assembly of Fusarium oxysporum f. sp. cepae (Foc), the causative agent of onion basal rot has been generated along with genomes of additional pathogenic and non-pathogenic isolates of onion. Phylogenetic analysis confirmed a single origin of the Foc pathogenic lineage. Genome alignments with other F. oxysporum ff. spp. and non pathogens revealed high levels of syntenic conservation of core chromosomes but little synteny between lineage specific (LS) chromosomes. Four LS contigs in Foc totaling 3.9 Mb were designated as pathogen-specific (PS). A two-fold increase in segmental duplication events was observed between LS regions of the genome compared to within core regions or from LS regions to the core. RNA-seq expression studies identified candidate effectors expressed in planta, consisting of both known effector homologs and novel candidates. FTF1 and a subset of other transcription factors implicated in regulation of effector expression were found to be expressed in planta

    Multi-platform​ ​ discovery​ ​ of​ ​ haplotype-resolved structural​ ​ variation​ ​ in​ ​ human​ ​ genomes

    Get PDF
    The incomplete identification of structural variants from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long- and short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent-child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,181 indel variants (<50 bp) and 31,599 structural variants (≥50 bp) per human genome, a seven fold increase in structural variation compared to previous reports, including from the 1000 Genomes Project. We also discovered 156 inversions per genome, most of which previously escaped detection, as well as large unbalanced chromosomal rearrangements. We provide near-complete, haplotype-resolved structural variation for three genomes that can now be used as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies
    • …
    corecore