1,834 research outputs found

    Flexible taxonomic assignment of ambiguous sequencing reads

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>To characterize the diversity of bacterial populations in metagenomic studies, sequencing reads need to be accurately assigned to taxonomic units in a given reference taxonomy. Reads that cannot be reliably assigned to a unique leaf in the taxonomy (<it>ambiguous reads</it>) are typically assigned to the lowest common ancestor of the set of species that match it. This introduces a potentially severe error in the estimation of bacteria present in the sample due to false positives, since all species in the subtree rooted at the ancestor are implicitly assigned to the read even though many of them may not match it.</p> <p>Results</p> <p>We present a method that maps each read to a node in the taxonomy that minimizes a penalty score while balancing the relevance of precision and recall in the assignment through a parameter <it>q</it>. This mapping can be obtained in time linear in the number of matching sequences, because LCA queries to the reference taxonomy take constant time. When applied to six different metagenomic datasets, our algorithm produces different taxonomic distributions depending on whether coverage or precision is maximized. Including information on the quality of the reads reduces the number of unassigned reads but increases the number of ambiguous reads, stressing the relevance of our method. Finally, two measures of performance are described and results with a set of artificially generated datasets are discussed.</p> <p>Conclusions</p> <p>The assignment strategy of sequencing reads introduced in this paper is a versatile and a quick method to study bacterial communities. The bacterial composition of the analyzed samples can vary significantly depending on how ambiguous reads are assigned depending on the value of the <it>q </it>parameter. Validation of our results in an artificial dataset confirm that a combination of values of <it>q </it>produces the most accurate results.</p

    Unbiased taxonomic annotation of metagenomic samples

    Get PDF
    The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this article, we show that the Rand index is a better indicator of classification error than the often used area under thereceiver operating characteristic (ROC) curve andF-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time and an exact solution can be obtained by integer linear programming. Experimental results with a proof-of-concept implementation of the set cover approach to taxonomic annotation in a next release of the TANGO software show that the set cover approach further reduces ambiguity in the taxonomic annotation obtained with TANGO without distorting the relative abundance profile of the metagenomic sample.Peer ReviewedPostprint (published version

    Unbiased taxonomic annotation of metagenomic samples

    Get PDF
    The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then, classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this paper, we show that the Rand index is a better indicator of classification error than the often used area under the ROC curve and F-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time.Peer ReviewedPostprint (author's final draft

    Genomics clarifies taxonomic boundaries in a difficult species complex.

    Get PDF
    Efforts to taxonomically delineate species are often confounded with conflicting information and subjective interpretation. Advances in genomic methods have resulted in a new approach to taxonomic identification that stands to greatly reduce much of this conflict. This approach is ideal for species complexes, where divergence times are recent (evolutionarily) and lineages less well defined. The California Roach/Hitch fish species complex is an excellent example, experiencing a convoluted geologic history, diverse habitats, conflicting species designations and potential admixture between species. Here we use this fish complex to illustrate how genomics can be used to better clarify and assign taxonomic categories. We performed restriction-site associated DNA (RAD) sequencing on 255 Roach and Hitch samples collected throughout California to discover and genotype thousands of single nucleotide polymorphism (SNPs). Data were then used in hierarchical principal component, admixture, and FST analyses to provide results that consistently resolved a number of ambiguities and provided novel insights across a range of taxonomic levels. At the highest level, our results show that the CA Roach/Hitch complex should be considered five species split into two genera (4 + 1) as opposed to two species from distinct genera (1 +1). Subsequent levels revealed multiple subspecies and distinct population segments within identified species. At the lowest level, our results indicate Roach from a large coastal river are not native but instead introduced from a nearby river. Overall, this study provides a clear demonstration of the power of genomic methods for informing taxonomy and serves as a model for future studies wishing to decipher difficult species questions. By allowing for systematic identification across multiple scales, taxonomic structure can then be tied to historical and contemporary ecological, geographic or anthropogenic factors

    Further Steps in TANGO: improved taxonomic assignment in metagenomics

    Get PDF
    Abstract Motivation: TANGO is one of the most accurate tools for the taxonomic assignment of sequence reads. However, because of the differences in the taxonomy structures, performing a taxonomic assignment on different reference taxonomies will produce divergent results. Results: We have improved the TANGO pipeline to be able to perform the taxonomic assignment of a metagenomic sample using alternative reference taxonomies, coming from different sources. We highlight the novel pre-processing step, necessary to accomplish this task, and describe the improvements in the assignment process. We present the new TANGO pipeline in details, and, finally, we show its performance on four real metagenomic datasets and also on synthetic datasets. Availability: The new version of TANGO, including implementation improvements and novel developments to perform the assignment on different reference taxonomies, is freely available at http://sourceforge.net/projects/taxoassignment/. Contact: [email protected]

    Taxonomic assignment in metagenomics with TANGO

    Get PDF
    One of the main computational challenges facing metagenomic analysis is the taxonomic identification of short DNA fragments. The combination of sequence alignment methods with taxonomic assignment based on consensus can provide an accurate estimate of the microbial diversity in a sample. In this note, we show how recent improvements to these consensus methods, as implemented in the latest release of the TANGO tool, can provide an improved estimate of diversity in simulated datasets.Peer ReviewedPostprint (published version
    • …
    corecore