23 research outputs found

    Discovery and genotyping of structural variation from long-read haploid genome sequence data

    Get PDF
    In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that &gt;89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF &gt; 1%). We estimate that this theoretical human diploid differs by as much as ∼16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ∼59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.</jats:p

    Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

    Get PDF
    The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes

    A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar.

    No full text
    Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies-as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format

    Genomic Patterns of De Novo Mutation in Simplex Autism.

    No full text
    To further our understanding of the genetic etiology of autism, we generated and analyzed genome sequence data from 516 idiopathic autism families (2,064 individuals). This resource includes &gt;59 million single-nucleotide variants (SNVs) and 9,212 private copy number variants (CNVs), of which 133,992 and 88 are de novo mutations (DNMs), respectively. We estimate a mutation rate of ∼1.5&nbsp;× 10-8 SNVs per site per generation with a significantly higher mutation rate in repetitive DNA. Comparing probands and unaffected siblings, we observe several DNM trends. Probands carry more gene-disruptive CNVs and SNVs, resulting in severe&nbsp;missense mutations and mapping to predicted fetal brain promoters and embryonic stem cell enhancers. These differences become more pronounced for autism genes (p&nbsp;= 1.8&nbsp;× 10-3, OR&nbsp;= 2.2). Patients are more likely to carry multiple coding and noncoding DNMs in different genes, which are enriched for expression in striatal neurons (p&nbsp;= 3&nbsp;× 10-3), suggesting a path forward for genetically characterizing more complex cases of autism

    Wham: Identifying Structural Variants of Biological Consequence

    No full text
    <div><p>Existing methods for identifying structural variants (SVs) from short read datasets are inaccurate. This complicates disease-gene identification and efforts to understand the consequences of genetic variation. In response, we have created Wham (Whole-genome Alignment Metrics) to provide a single, integrated framework for both structural variant calling and association testing, thereby bypassing many of the difficulties that currently frustrate attempts to employ SVs in association testing. Here we describe Wham, benchmark it against three other widely used SV identification tools–Lumpy, Delly and SoftSearch–and demonstrate Wham’s ability to identify and associate SVs with phenotypes using data from humans, domestic pigeons, and vaccinia virus. Wham and all associated software are covered under the MIT License and can be freely downloaded from github (<a href="https://github.com/zeeev/wham" target="_blank">https://github.com/zeeev/wham</a>), with documentation on a wiki (<a href="http://zeeev.github.io/wham/" target="_blank">http://zeeev.github.io/wham/</a>). For community support please post questions to <a href="https://www.biostars.org/" target="_blank">https://www.biostars.org/</a>.</p></div

    Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C

    Get PDF
    Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a triobased approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotyperesolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80–91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs

    Sensitivity and false discovery rates (FDR) for simulated data.

    No full text
    <p>The sensitivity and FDR of Delly, Lumpy, SoftSearch and Wham for simulated deletions, duplications, insertions and inversions. The sensitivity is measured for each category at depths of 10x and 50x. SVs ranging from 50 bp to 1 Mb are grouped into four left-closed size intervals. <b>A)</b> The sensitivity of the three tools is faceted on size, depth and SV type. At 10x Wham has noticeably better sensitivity for deletions and duplications in the smallest size class. Wham’s sensitivity is higher than Delly and Lumpy for insertions at 10x and gains sensitivity at 50x. <b>B)</b> The FDR for each type of SV faceted by depth and the amount of slop added to each confidence interval. In the 25 bp slop category, each confidence interval was extended in both directions by 25 bp. At 10x depth Wham has the highest FDR across all SV classes and Lumpy has the lowest. At 50x Delly has heightened FDR for deletions and Lumpy has a much higher FDR for insertions. Shrinking the confidence intervals increases the FDR for Delly and Lumpy, but not Wham. <b>C)</b> Breakpoint sensitivity for deletions. The confidence intervals, provided by the three tools are ignored and slop is incrementally added to the predicted breakpoints. Wham has the highest sensitivity when 1–10 bp of slop is added. <b>D)</b> Genotype sensitivity for the homozygous non-reference simulated SVs. Delly and Wham have similar sensitivity for deletions and duplications while both tools fail to correctly genotype duplications.</p

    Wham detects structural variation in vaccinia virus populations.

    No full text
    <p><b>A)</b> Read depth normalized within each sample is plotted across the ~200 kb vaccinia genome (excluding inverted terminal repeats) for either the parental strain (top panel) or an adapted strain (middle and bottom panels, called by Wham or Lumpy, respectively). Arrows highlight the positions of K3L CNV and E3L deletion. The black lines represent the breakpoints of every SV call after filtering (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004572#sec016" target="_blank">Supporting Information</a>). <b>B)</b> Wham calls in the adapted strain near the K3L duplication breakpoint are shown as black triangles above the viral genes in colored boxes. The height of the triangle represents split-read (SR) count supporting the call. Sanger sequencing positions relative to the reference sequence are listed below. Asterisks (*) indicate Wham calls that match the exact breakpoint determined by Sanger sequencing (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004572#pcbi.1004572.s004" target="_blank">S3 Table</a> for Wham and Lumpy breakpoints). <b>C)</b> Wham calls in the adapted strain near the E3L deletion are shown above the genes, and Sanger sequence confirmed positions below, as in B. The arrow indicates the position of the 11K promoter driving β-gal expression. For breakpoints in grey, the height of the triangle indicates the relative mate-pair count from Wham, as these positions do not have SR support.</p

    Benchmarking Delly, Lumpy, SoftSearch and Wham against NA12878 and CHM1 datasets.

    No full text
    <p><b>A)</b> The sensitivity and FDR for filtered NA12878 Phase III deletion calls across four size intervals. The number of true positives and the number NA12878 calls are listed above sensitivity, while the total number of false positives and total calls for each tool is listed above FDR. Most true positives and false positives are within the 150–1,000 bp interval. <b>B)</b> The sensitivity and FDR for CHM1 deletions. <b>C)</b> The size distribution of the true positive calls that overlap the CHM1 deletions. One thousand true positives were randomly sampled from each tool and the truth set (CHM1-DEL).</p
    corecore