37 research outputs found

    Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

    No full text
    <div><p>Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small <i>e.g.</i>, 180 bp and large <i>e.g.</i>, 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.</p></div

    Venn diagram of the overlap in false negative (A) and false positive (B) calls by the three variant detection tools, Pilon, GATK UnifiedGenotyper and SAMtools.

    No full text
    <p>False negative calls are the number of unique events from the curation set that was missed by each tool. Overlaps in the Venn diagram show the number of variants that were missed by multiple tools. False positive calls are the number of predictions from <i>M. tuberculosis</i> F11 that were not supported by the curation set. Overlaps indicate predictions that were shared among tools.</p

    Example Pilon generated genome browser tracks.

    No full text
    <p>This region was flagged by Pilon as containing a possible local mis-assembly, but Pilon was unable to determine a fix due to a tandem repeat sequence. The tracks shown here include: <i>Pilon Features</i> track indicating the extent of the region flagged by Pilon as containing a potential mis-assembly, <i>Valid Coverage</i> track indicating the sequence coverage of valid read pair alignments excluding the clipped portions of the alignments, <i>Clipped Alignments</i> track indicating the number of reads soft-clipped at each location, <i>Pct Bad Alignments</i> track indicating the percentage of the total reads aligned to each location which are not part of <i>Valid Coverage</i>. These tracks are created with the ‘—tracks' command-line option. Together, these tracks reveal the true bounds of the mis-assembly, and indicate that there are likely missing copies of the tandem repeat in the draft assembly. In this case, manual analysis revealed the draft assembly was missing two of three full copies of a 57-base tandem repeat.</p

    Summary assembly statistics before and after Pilon improvement.

    No full text
    <p>In all cases the assemblies were more contiguous, contained more bases, and had fewer gaps and errors after Pilon improvement.</p><p>Summary assembly statistics before and after Pilon improvement.</p

    Pilon's performance in calling variants in <i>M. tuberculosis</i> F11 that were larger than 50 nt.

    No full text
    <p>Variants are divided by type across the rows. Missed variants are those that were annotated in the curation, but were not identified by Pilon. The called variants are those that were annotated in the curation that Pilon accurately identified.</p><p>Pilon's performance in calling variants in <i>M. tuberculosis</i> F11 that were larger than 50 nt.</p

    Comparative view of a transposase-rich region of the <i>M. tuberculosis</i> F11 genome (coordinates 1,991,000 to 2,006,300) obtained from the draft (A) and Pilon-improved (B) assemblies.

    No full text
    <p>In the draft assembly, three regions containing transposases (shown in blue) remained unassembled resulting in gaps. In the Pilon-improved assembly, all three sets of transposases were successfully assembled. The Pilon-improved assembly also contained a hypothetical gene, <i>TBFG_11790</i> (shown in red), missing from the draft assembly. Though <i>TBFG_11790</i> was not fully closed in the Pilon-improved version, closer inspection revealed that there was a 42 bp overlap in assembled sequence at this site. By default, Pilon will not close gaps unless there is at least 95 bp overlapping sequence to minimize spurious joins.</p

    Simplified overview of the Pilon workflow for assembly improvement and variant detection.

    No full text
    <p>The left column depicts the conceptual steps of the Pilon process, and the center and right columns describe what Pilon does at each step while in assembly improvement and variant detection modes, respectively. During the first step (top row), Pilon scans the read alignments for evidence where the sequencing data disagree with the input genome and makes corrections to small errors and detects small variants. During the second step (second row), Pilon looks for coverage and alignment discrepancies to identify potential mis-assemblies and larger variants. Finally (bottom row), Pilon uses reads and mate pairs which are anchored to the flanks of discrepant regions and gaps in the input genome to reassemble the area, attempting to fill in the true sequence including large insertions. The resulting output is an improved assembly and/or a VCF file of variants.</p

    Recall and precision metrics for <i>M. tuberculosis</i> F11 variants called against <i>M. tuberculosis</i> H37Rv by Pilon (with and without long insert library data), GATK UnifiedGenotyper and SAMtools.

    No full text
    <p>The three rows marked with 'Single' indicate single nucleotide variants. The three rows marked with 'Multi' indicate variants involving two or more nucleotides, which also include very large events that span several Kb. Recall (R) is the fraction of curated events that were called by the program. Precision (P) is the fraction of calls that the program made that were also described in the curation. The F-measure is the harmonic mean of recall and precision and provides measure of the trade-off between recall and precision. “N/A” indicates that all events of this type were captured in another variant category.</p><p>Recall and precision metrics for <i>M. tuberculosis</i> F11 variants called against <i>M. tuberculosis</i> H37Rv by Pilon (with and without long insert library data), GATK UnifiedGenotyper and SAMtools.</p

    Genome statistics of <i>L. monocytogenes</i> genome sequences used in this study.

    No full text
    <p>Contig N50 values are given for draft (unfinished) genomes assembled here, while the percent Q40 bases is given for all genomes assembled here.</p>1<p>N/A = not available, because the genome sequence is closed.</p>2<p>Percentage Q40 bases is only given for genome sequences newly presented in this publication.</p
    corecore