5,866 research outputs found

    Genome-wide signatures of complex introgression and adaptive evolution in the big cats.

    Get PDF
    The great cats of the genus Panthera comprise a recent radiation whose evolutionary history is poorly understood. Their rapid diversification poses challenges to resolving their phylogeny while offering opportunities to investigate the historical dynamics of adaptive divergence. We report the sequence, de novo assembly, and annotation of the jaguar (Panthera onca) genome, a novel genome sequence for the leopard (Panthera pardus), and comparative analyses encompassing all living Panthera species. Demographic reconstructions indicated that all of these species have experienced variable episodes of population decline during the Pleistocene, ultimately leading to small effective sizes in present-day genomes. We observed pervasive genealogical discordance across Panthera genomes, caused by both incomplete lineage sorting and complex patterns of historical interspecific hybridization. We identified multiple signatures of species-specific positive selection, affecting genes involved in craniofacial and limb development, protein metabolism, hypoxia, reproduction, pigmentation, and sensory perception. There was remarkable concordance in pathways enriched in genomic segments implicated in interspecies introgression and in positive selection, suggesting that these processes were connected. We tested this hypothesis by developing exome capture probes targeting ~19,000 Panthera genes and applying them to 30 wild-caught jaguars. We found at least two genes (DOCK3 and COL4A5, both related to optic nerve development) bearing significant signatures of interspecies introgression and within-species positive selection. These findings indicate that post-speciation admixture has contributed genetic material that facilitated the adaptive evolution of big cat lineages

    Chromosomal-level assembly of the Asian Seabass genome using long sequence reads and multi-layered scaffolding

    Get PDF
    We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics

    Multi-platform discovery of haplotype-resolved structural variation in human genomes

    Get PDF

    Identifying Structural Variation in Haploid Microbial Genomes from Short-Read Resequencing Data Using Breseq

    Get PDF
    Mutations that alter chromosomal structure play critical roles in evolution and disease, including in the origin of new lifestyles and pathogenic traits in microbes. Large-scale rearrangements in genomes are often mediated by recombination events involving new or existing copies of mobile genetic elements, recently duplicated genes, or other repetitive sequences. Most current software programs for predicting structural variation from short-read DNA resequencing data are intended primarily for use on human genomes. They typically disregard information in reads mapping to repeat sequences, and significant post-processing and manual examination of their output is often required to rule out false-positive predictions and precisely describe mutational events. Results: We have implemented an algorithm for identifying structural variation from DNA resequencing data as part of the breseq computational pipeline for predicting mutations in haploid microbial genomes. Our method evaluates the support for new sequence junctions present in a clonal sample from split-read alignments to a reference genome, including matches to repeat sequences. Then, it uses a statistical model of read coverage evenness to accept or reject these predictions. Finally, breseq combines predictions of new junctions and deleted chromosomal regions to output biologically relevant descriptions of mutations and their effects on genes. We demonstrate the performance of breseq on simulated Escherichia coli genomes with deletions generating unique breakpoint sequences, new insertions of mobile genetic elements, and deletions mediated by mobile elements. Then, we reanalyze data from an E. coli K-12 mutation accumulation evolution experiment in which structural variation was not previously identified. Transposon insertions and large-scale chromosomal changes detected by breseq account for similar to 25% of spontaneous mutations in this strain. In all cases, we find that breseq is able to reliably predict structural variation with modest read-depth coverage of the reference genome (>40-fold). Conclusions: Using breseq to predict structural variation should be useful for studies of microbial epidemiology, experimental evolution, synthetic biology, and genetics when a reference genome for a closely related strain is available. In these cases, breseq can discover mutations that may be responsible for important or unintended changes in genomes that might otherwise go undetected.U.S. National Institutes of Health R00-GM087550U.S. National Science Foundation (NSF) DEB-0515729NSF BEACON Center for the Study of Evolution in Action DBI-0939454Cancer Prevention & Research Institute of Texas (CPRIT) RP130124University of Texas at Austin startup fundsUniversity of Texas at AustinCPRIT Cancer Research TraineeshipMolecular Bioscience

    SVIM: Structural Variant Identification using Mapped Long Reads

    No full text
    Motivation: Structural variants are defined as genomic variants larger than 50bp. They have been shown to affect more bases in any given genome than SNPs or small indels. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities. Results: We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from PacBio and Nanopore sequencing machines. Availability and implementation: The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index. Supplementary information: Supplementary data are available at Bioinformatics online

    PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

    Get PDF
    Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.published_or_final_versio

    New Approaches to Long-Read Assembly under High Error Rates

    Get PDF
    Das Gebiet der Genomassemblierung beschäftigt sich mit der Entwicklung von Algorithmen, die Genome am Computer anhand von Sequenzierungsdaten rekonstruieren. Es geriet erstmals in den Neunzigern mit dem Human Genome Project in den Fokus der Öffentlichkeit. Da nur kurze Abschnitte des menschlichen Genoms ausgelesen werden konnten, musste die Rekonstruktion längerer Genomsequenzen aus den ausgelesenen Abschnitten im Nachhinein am Computer erfolgen. Auch fast 20 Jahre nach der Veröffentlichung der menschlichen Genomsequenzen stellt die Genomeassemblierung nach wie vor noch einen essentiellen Verarbeitungsschritt für Sequenzierungsdaten dar. Nur Datendurchsatz, Länge und Fehlerprofil der ausgelesenen Genomabschnitte haben sich verändert und damit einhergehend auch die algorithmischen Anforderungen. Damit komplementiert das Forschungsgebiet der Genomeassemblierung die Sequenzierungstechnologien, die sich mit enormer Geschwindigkeit weiter entwickelt haben. Zusammen erlauben sie die Entschlüsselung der Genome einer stark zunehmenden Anzahl von Lebewesen und bilden damit die Grundlage für einen Großteil der Forschung in verschiedensten Bereichen der Biologie und Medizin. Trotz der beeindruckenden technologischen und algorithmischen Entwicklungen der vergangenen Jahrzehnte ist es bisher nur für bakterielle Genome gelungen, die komplette Genomsequenz zu rekontruieren. Bei der Assemblierung der wesentlich größeren eukaryotischen Genome bestehen mehrere ungelöste algorithmische Probleme. Diese Probleme hängen mit verschiedenen repetitiven Strukturen zusammen, die in fast allen Genomen höherer Lebewesen vorkommen. Deshalb werden eukaryotische Genome immer in wesentlich mehr unzusammenhängenden Sequenzen veröffentlicht als die jeweiligen Lebewesen Chromosomen haben. Die repetitiven Strukturen, die für die Lücken in den Genomsequenzen verantwortlich sind, lassen sich grob in drei Klassen unterteilen. Mikrosatelliten und Minisatelliten sind sehr kurze Sequenzen, die sich tausende oder zehntausende Male direkt aufeinander folgend wiederholen können. Dieses Muster ist typisch für sogenannte Centromere und Telomere, die sich in der Mitte und an den Enden vieler Chromosome befinden. Sogenannte Interspersed Repeats, oft auch als Transposons bezeichnet, sind längere Sequenzen, die häufig in fast identischer Form an unterschiedlichen Stellen im Genome vorkommen. Sogenannte Tandem Repeats dagegen sind längere Sequenzen, die direkt aufeinanderfolgend mehrere Male in einem Genom auftreten können. Oft sind Tandem Repeats Genkomplexe, das heißt Ansammlungen fast identischer proteinkodierender Abschnitte, die es der Zelle erlauben, die kodierten Proteine besonders schnell zu produzieren. Jede dieser repetitive Strukturen stellt spezifische Anforderung an Assemblierungsalgorithmen. In dieser Doktorarbeit leisten wir mehrere Beiträge zur Lösung der letzteren zwei vorgestellten Probleme, der Assemblierung von Interspersed Repeats und Tandem Repeats. In Teil 1 der Arbeit stellen wir mehrere Datenverarbeitungsprozeduren vor, die Sequenzierungsdaten aufbereiten, um die seltenen Unterschiede zwischen mehrfach auftretenden Genomsequenzen zu identifizieren. Diese beinhalten Softwareprogramme zur Berechnung und Optimierung von Multiplen Sequenz Alignments (MSA) anhand dynamischer Programmierung und zur statistischen Modellierung und Analyse der Unterschiede, wie das MSA sie präsentiert. In Teil 2 bauen wir auf dieser Analyse auf und präsentieren ein Softwareprogramm zur Assemblierung von Interspersed Repeats. Dieses Programm baut auf mehreren algorithmischen Neuerungen auf und ist in der Lage, Transposonfamilien mit sehr langen Sequenzen und sehr vielen verschiedenen Kopien effektiv zu assemblieren. Es ist das erste Programm dieser Art, welches in der Lage ist, Transposonfamilien mit dutzenden von Kopien zu assemblieren. Es gelingt uns zu zeigen, dass es auch für kleinere Transposonfamilien akkurater und schneller ist als das bisher einzige Konkurrenzprogramm, welches auf dieses Assemblierungsproblem spezialisiert ist. In Teil 3 beschreiben wir eine Analysepipeline, die es uns ermöglicht, Genkomplexe aus dutzenden von Tandem Repeats zu assemblieren. Diese Pipeline enthält Clustering und Graph Drawing Algorithmen. Ihr Herzstück ist ein Fehlerkorrekturalgorithmus, der auf Neuronalen Netzwerken basiert. Wir demonstrieren den praktischen Nutzen dieser Pipeline durch die Assemblierung des Drosophila Histone Komplexes. Im Abschluss diskutieren wir die Möglichkeit, Mikro- und Minisatelliten zu assemblieren und schlagen Forschungsansätze für weitere Verbesserungen im Bereich der Interspersed Repeat- und Genkomplexassemblierung vor
    corecore