49 research outputs found

    Synteny Paths for Assembly Graphs Comparison

    Get PDF
    Despite the recent developments of long-read sequencing technologies, it is still difficult to produce complete assemblies of eukaryotic genomes in an automated fashion. Genome assembly software typically output assembled fragments (contigs) along with assembly graphs, that encode all possible layouts of these contigs. Graph representation of the assembled genome can be useful for gene discovery, haplotyping, structural variations analysis and other applications. To facilitate the development of new graph-based approaches, it is important to develop algorithms for comparison and evaluation of assembly graphs produced by different software. In this work, we introduce synteny paths: maximal paths of homologous sequence between the compared assembly graphs. We describe Asgan - an algorithm for efficient synteny paths decomposition, and use it to evaluate assembly graphs of various bacterial assemblies produced by different approaches. We then apply Asgan to discover structural variations between the assemblies of 15 Drosophila genomes, and show that synteny paths are robust to contig fragmentation. The Asgan tool is freely available at: https://github.com/epolevikov/Asgan

    SpectroGene: A Tool for Proteogenomic Annotations Using Top-Down Spectra

    Get PDF
    In the past decade, proteogenomics has emerged as a valuable technique that contributes to the state-of-the-art in genome annotation; however, previous proteogenomic studies were limited to bottom-up mass spectrometry and did not take advantage of top-down approaches. We show that top-down proteogenomics allows one to address the problems that remained beyond the reach of traditional bottom-up proteogenomics. In particular, we show that top-down proteogenomics leads to the discovery of previously unannotated genes even in extensively studied bacterial genomes and present SpectroGene, a software tool for genome annotation using top-down tandem mass spectra. We further show that top-down proteogenomics searches (against the six-frame translation of a genome) identify nearly all proteoforms found in traditional top-down proteomics searches (against the annotated proteome). SpectroGene is freely available at http://github.com/fenderglass/SpectroGene

    Assembly of long error-prone reads using de Bruijn graphs

    Get PDF
    The recent breakthroughs in assembling long error-prone reads were based on the overlap-layout-consensus (OLC) approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the OLC approach is the only practical paradigm for assembling long error-prone reads. We show how to generalize de Bruijn graphs for assembling long error-prone reads and describe the ABruijn assembler, which combines the de Bruijn graph and the OLC approaches and results in accurate genome reconstructions

    Some aspects of the mm-adic analysis and its applications to mm-adic stochastic processes

    Full text link
    In this paper we consider a generalization of analysis on pp-adic numbers field to the mm case of mm-adic numbers ring. The basic statements, theorems and formulas of pp-adic analysis can be used for the case of mm-adic analysis without changing. We discuss basic properties of mm-adic numbers and consider some properties of mm-adic integration and mm-adic Fourier analysis. The class of infinitely divisible mm-adic distributions and the class of mm-adic stochastic Levi processes were introduced. The special class of mm-adic CTRW process and fractional-time mm-adic random walk as the diffusive limit of it is considered. We found the asymptotic behavior of the probability measure of initial distribution support for fractional-time mm-adic random walk.Comment: 18 page

    Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes

    Get PDF
    Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology

    Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes.

    Get PDF
    Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology

    Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci.

    Get PDF
    We report full-length draft de novo genome assemblies for 16 widely used inbred mouse strains and find extensive strain-specific haplotype variation. We identify and characterize 2,567 regions on the current mouse reference genome exhibiting the greatest sequence diversity. These regions are enriched for genes involved in pathogen defence and immunity and exhibit enrichment of transposable elements and signatures of recent retrotransposition events. Combinations of alleles and genes unique to an individual strain are commonly observed at these loci, reflecting distinct strain phenotypes. We used these genomes to improve the mouse reference genome, resulting in the completion of 10 new gene structures. Also, 62 new coding loci were added to the reference genome annotation. These genomes identified a large, previously unannotated, gene (Efcab3-like) encoding 5,874 amino acids. Mutant Efcab3-like mice display anomalies in multiple brain regions, suggesting a possible role for this gene in the regulation of brain development
    corecore