19 research outputs found

    Computational Methods for Structural Variation Analysis in Populations

    Get PDF
    Recent advances in long-read sequencing have given us an unprecedented view of structural variants (SVs). However, much of their role in disease and evolution remains unknown due to a number of technical and biological challenges, including the high error rate of most long-read sequencing data, the additional complexity of aligning around large variants, and biological differences in how the same SV can manifest in different individuals. In this thesis we introduce novel methods for structural variant analysis and demonstrate how they overcome many of these obstacles. First, we apply recent advances in data structures to the substring search problem and show how learned index structures can enable accelerated alignment of genomic reads. Next, we present an optimized SV calling pipeline that integrates improvements to existing software alongside two novel SV-processing methods, Iris and Jasmine, which improve the accuracy of SV breakpoints and sequences in individual samples and compare and integrate SV calls from multiple samples. Finally, we show how the introduction of CHM13, the first gap-free telomere-to-telomere human reference genome, enables for the first time variant calling in over 100 Mbp of newly resolved sequence and mitigates long-standing issues in variant calling that were attributed to gaps, errors, and minor alleles in the prior GRCh38 reference. We demonstrate the broad applicability of our advancements in SV inference by uncovering novel associations with gene expression in 444 human individuals from the 1000 Genomes Project, by detecting SVs in the tomato genome which affect fruit size and yield, and by comparing SVs between tumor and normal cells in organoids derived from the SKBR3 breast cancer cell line

    Jasmine: Population-scale structural variant comparison and analysis

    Get PDF
    The increasing availability of long-reads is revolutionizing studies of structural variants (SVs). However, because SVs vary across individuals and are discovered through imprecise read technologies and methods, they can be difficult to compare. Addressing this, we present Jasmine (https://github.com/mkirsche/Jasmine ), a fast and accurate method for SV refinement, comparison, and population analysis. Using an SV proximity graph, Jasmine outperforms five widely-used comparison methods, including reducing the rate of Mendelian discordance in trio datasets by more than five-fold, and reveals a set of high confidence de novo SVs confirmed by multiple long-read technologies. We also present a harmonized callset of 205,192 SVs from 31 samples of diverse ancestry sequenced with long reads. We genotype these SVs in 444 short read samples from the 1000 Genomes Project with both DNA and RNA sequencing data and assess their widespread impact on gene expression, including within several medically relevant genes

    Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing

    Get PDF
    Advancing crop genomics requires efficient genetic systems enabled by high-quality personalized genome assemblies. Here, we introduce RagTag, a toolset for automating assembly scaffolding and patching, and we establish chromosome-scale reference genomes for the widely used tomato genotype M82 along with Sweet-100, a new rapid-cycling genotype that we developed to accelerate functional genomics and genome editing in tomato. This work outlines strategies to rapidly expand genetic systems and genomic resources in other plant species

    Genomic diversity of SARS-CoV-2 during early introduction into the Baltimore-Washington metropolitan area.

    Get PDF
    The early COVID-19 pandemic was characterized by rapid global spread. In Maryland and Washington, DC, United States, more than 2500 cases were reported within 3 weeks of the first COVID-19 detection in March 2020. We aimed to use genomic sequencing to understand the initial spread of SARS-CoV-2 - the virus that causes COVID-19 - in the region. We analyzed 620 samples collected from the Johns Hopkins Health System during March 11-31, 2020, comprising 28.6% of the total cases in Maryland and Washington, DC. From these samples, we generated 114 complete viral genomes. Analysis of these genomes alongside a subsampling of over 1000 previously published sequences showed that the diversity in this region rivaled global SARS-CoV-2 genetic diversity at that time and that the sequences belong to all of the major globally circulating lineages, suggesting multiple introductions into the region. We also analyzed these regional SARS-CoV-2 genomes alongside detailed clinical metadata and found that clinically severe cases had viral genomes belonging to all major viral lineages. We conclude that efforts to control local spread of the virus were likely confounded by the number of introductions into the region early in the epidemic and the interconnectedness of the region as a whole

    Multi-tissue integrative analysis of personal epigenomes

    Get PDF
    Evaluating the impact of genetic variants on transcriptional regulation is a central goal in biological science that has been constrained by reliance on a single reference genome. To address this, we constructed phased, diploid genomes for four cadaveric donors (using long-read sequencing) and systematically charted noncoding regulatory elements and transcriptional activity across more than 25 tissues from these donors. Integrative analysis revealed over a million variants with allele-specific activity, coordinated, locus-scale allelic imbalances, and structural variants impacting proximal chromatin structure. We relate the personal genome analysis to the ENCODE encyclopedia, annotating allele- and tissue-specific elements that are strongly enriched for variants impacting expression and disease phenotypes. These experimental and statistical approaches, and the corresponding EN-TEx resource, provide a framework for personalized functional genomics

    Computational Methods for Structural Variation Analysis in Populations

    No full text
    Recent advances in long-read sequencing have given us an unprecedented view of structural variants (SVs). However, much of their role in disease and evolution remains unknown due to a number of technical and biological challenges, including the high error rate of most long-read sequencing data, the additional complexity of aligning around large variants, and biological differences in how the same SV can manifest in different individuals. In this thesis we introduce novel methods for structural variant analysis and demonstrate how they overcome many of these obstacles. First, we apply recent advances in data structures to the substring search problem and show how learned index structures can enable accelerated alignment of genomic reads. Next, we present an optimized SV calling pipeline that integrates improvements to existing software alongside two novel SV-processing methods, Iris and Jasmine, which improve the accuracy of SV breakpoints and sequences in individual samples and compare and integrate SV calls from multiple samples. Finally, we show how the introduction of CHM13, the first gap-free telomere-to-telomere human reference genome, enables for the first time variant calling in over 100 Mbp of newly resolved sequence and mitigates long-standing issues in variant calling that were attributed to gaps, errors, and minor alleles in the prior GRCh38 reference. We demonstrate the broad applicability of our advancements in SV inference by uncovering novel associations with gene expression in 444 human individuals from the 1000 Genomes Project, by detecting SVs in the tomato genome which affect fruit size and yield, and by comparing SVs between tumor and normal cells in organoids derived from the SKBR3 breast cancer cell line

    Democratizing long-read genome assembly.

    No full text
    De novo assembled genomes serve as the backbone for modern genomics. In an article in this issue of Cell Systems, Ekim et al. present the mdBG assembler that can assemble genomes 100-fold faster than previous methods, including a human genome in under 10 min, which unlocks pan-genomics for many species

    Sapling: accelerating suffix array queries with learned data models.

    No full text
    MOTIVATION: As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets. RESULTS: Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding <1% to the suffix array's memory footprint. AVAILABILITY AND IMPLEMENTATION: The source code and tutorial are available open-source at https://github.com/mkirsche/sapling. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
    corecore