18 research outputs found

    Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

    No full text
    <div><p>Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small <i>e.g.</i>, 180 bp and large <i>e.g.</i>, 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.</p></div

    Example Pilon generated genome browser tracks.

    No full text
    <p>This region was flagged by Pilon as containing a possible local mis-assembly, but Pilon was unable to determine a fix due to a tandem repeat sequence. The tracks shown here include: <i>Pilon Features</i> track indicating the extent of the region flagged by Pilon as containing a potential mis-assembly, <i>Valid Coverage</i> track indicating the sequence coverage of valid read pair alignments excluding the clipped portions of the alignments, <i>Clipped Alignments</i> track indicating the number of reads soft-clipped at each location, <i>Pct Bad Alignments</i> track indicating the percentage of the total reads aligned to each location which are not part of <i>Valid Coverage</i>. These tracks are created with the ‘—tracks' command-line option. Together, these tracks reveal the true bounds of the mis-assembly, and indicate that there are likely missing copies of the tandem repeat in the draft assembly. In this case, manual analysis revealed the draft assembly was missing two of three full copies of a 57-base tandem repeat.</p

    Pilon's performance in calling variants in <i>M. tuberculosis</i> F11 that were larger than 50 nt.

    No full text
    <p>Variants are divided by type across the rows. Missed variants are those that were annotated in the curation, but were not identified by Pilon. The called variants are those that were annotated in the curation that Pilon accurately identified.</p><p>Pilon's performance in calling variants in <i>M. tuberculosis</i> F11 that were larger than 50 nt.</p

    Venn diagram of the overlap in false negative (A) and false positive (B) calls by the three variant detection tools, Pilon, GATK UnifiedGenotyper and SAMtools.

    No full text
    <p>False negative calls are the number of unique events from the curation set that was missed by each tool. Overlaps in the Venn diagram show the number of variants that were missed by multiple tools. False positive calls are the number of predictions from <i>M. tuberculosis</i> F11 that were not supported by the curation set. Overlaps indicate predictions that were shared among tools.</p

    Summary assembly statistics before and after Pilon improvement.

    No full text
    <p>In all cases the assemblies were more contiguous, contained more bases, and had fewer gaps and errors after Pilon improvement.</p><p>Summary assembly statistics before and after Pilon improvement.</p

    Whole Genome Sequencing of <i>Mycobacterium africanum</i> Strains from Mali Provides Insights into the Mechanisms of Geographic Restriction

    No full text
    <div><p>Background</p><p><i>Mycobacterium africanum</i>, made up of lineages 5 and 6 within the <i>Mycobacterium tuberculosis</i> complex (MTC), causes up to half of all tuberculosis cases in West Africa, but is rarely found outside of this region. The reasons for this geographical restriction remain unknown. Possible reasons include a geographically restricted animal reservoir, a unique preference for hosts of West African ethnicity, and an inability to compete with other lineages outside of West Africa. These latter two hypotheses could be caused by loss of fitness or altered interactions with the host immune system.</p><p>Methodology/Principal Findings</p><p>We sequenced 92 MTC clinical isolates from Mali, including two lineage 5 and 24 lineage 6 strains. Our genome sequencing assembly, alignment, phylogeny and average nucleotide identity analyses enabled us to identify features that typify lineages 5 and 6 and made clear that these lineages do not constitute a distinct species within the MTC. We found that in Mali, lineage 6 and lineage 4 strains have similar levels of diversity and evolve drug resistance through similar mechanisms. In the process, we identified a putative novel streptomycin resistance mutation. In addition, we found evidence of person-to-person transmission of lineage 6 isolates and showed that lineage 6 is not enriched for mutations in virulence-associated genes.</p><p>Conclusions</p><p>This is the largest collection of lineage 5 and 6 whole genome sequences to date, and our assembly and alignment data provide valuable insights into what distinguishes these lineages from other MTC lineages. Lineages 5 and 6 do not appear to be geographically restricted due to an inability to transmit between West African hosts or to an elevated number of mutations in virulence-associated genes. However, lineage-specific mutations, such as mutations in cell wall structure, secretion systems and cofactor biosynthesis, provide alternative mechanisms that may lead to host specificity.</p></div

    Comparative view of a transposase-rich region of the <i>M. tuberculosis</i> F11 genome (coordinates 1,991,000 to 2,006,300) obtained from the draft (A) and Pilon-improved (B) assemblies.

    No full text
    <p>In the draft assembly, three regions containing transposases (shown in blue) remained unassembled resulting in gaps. In the Pilon-improved assembly, all three sets of transposases were successfully assembled. The Pilon-improved assembly also contained a hypothetical gene, <i>TBFG_11790</i> (shown in red), missing from the draft assembly. Though <i>TBFG_11790</i> was not fully closed in the Pilon-improved version, closer inspection revealed that there was a 42 bp overlap in assembled sequence at this site. By default, Pilon will not close gaps unless there is at least 95 bp overlapping sequence to minimize spurious joins.</p

    Simplified overview of the Pilon workflow for assembly improvement and variant detection.

    No full text
    <p>The left column depicts the conceptual steps of the Pilon process, and the center and right columns describe what Pilon does at each step while in assembly improvement and variant detection modes, respectively. During the first step (top row), Pilon scans the read alignments for evidence where the sequencing data disagree with the input genome and makes corrections to small errors and detects small variants. During the second step (second row), Pilon looks for coverage and alignment discrepancies to identify potential mis-assemblies and larger variants. Finally (bottom row), Pilon uses reads and mate pairs which are anchored to the flanks of discrepant regions and gaps in the input genome to reassemble the area, attempting to fill in the true sequence including large insertions. The resulting output is an improved assembly and/or a VCF file of variants.</p

    Recall and precision metrics for <i>M. tuberculosis</i> F11 variants called against <i>M. tuberculosis</i> H37Rv by Pilon (with and without long insert library data), GATK UnifiedGenotyper and SAMtools.

    No full text
    <p>The three rows marked with 'Single' indicate single nucleotide variants. The three rows marked with 'Multi' indicate variants involving two or more nucleotides, which also include very large events that span several Kb. Recall (R) is the fraction of curated events that were called by the program. Precision (P) is the fraction of calls that the program made that were also described in the curation. The F-measure is the harmonic mean of recall and precision and provides measure of the trade-off between recall and precision. “N/A” indicates that all events of this type were captured in another variant category.</p><p>Recall and precision metrics for <i>M. tuberculosis</i> F11 variants called against <i>M. tuberculosis</i> H37Rv by Pilon (with and without long insert library data), GATK UnifiedGenotyper and SAMtools.</p

    Percentage of lineage-specific mutations in virulence associated genes.

    No full text
    <p>A) Percentage of lineage-specific mutations in coding sequences of the genes in each category. Sassetti virulence genes are genes that were identified in [<a href="http://www.plosntds.org/article/info:doi/10.1371/journal.pntd.0004332#pntd.0004332.ref066" target="_blank">66</a>] as being required for virulence in mice. Sassetti essential and slow growth genes were identified by Sassetti et al. under <i>in vitro</i> conditions using TraSH [<a href="http://www.plosntds.org/article/info:doi/10.1371/journal.pntd.0004332#pntd.0004332.ref065" target="_blank">65</a>]. Rengarajan macrophage genes were identified by Rengarajan et al. as being required for growth in macrophages [<a href="http://www.plosntds.org/article/info:doi/10.1371/journal.pntd.0004332#pntd.0004332.ref067" target="_blank">67</a>]. Comas antigen genes were genes identified by Comas et al. as containing T cell epitopes [<a href="http://www.plosntds.org/article/info:doi/10.1371/journal.pntd.0004332#pntd.0004332.ref004" target="_blank">4</a>]. The color of the bar indicates type of mutation. B) Percentage of lineage-specific pseudogenes falling into the above defined categories. Missing categories had no pseudogenes in any lineage. Lineage is indicated by the number below each bar, while ‘af’ indicates mutations found in both lineages 5 and 6 (both <i>M</i>. <i>africanum</i> lineages).</p
    corecore