7 research outputs found

    The impact of transitive annotation on the training of taxonomic classifiers

    Get PDF
    IntroductionA common task in the analysis of microbial communities involves assigning taxonomic labels to the sequences derived from organisms found in the communities. Frequently, such labels are assigned using machine learning algorithms that are trained to recognize individual taxonomic groups based on training data sets that comprise sequences with known taxonomic labels. Ideally, the training data should rely on labels that are experimentally verified—formal taxonomic labels require knowledge of physical and biochemical properties of organisms that cannot be directly inferred from sequence alone. However, the labels associated with sequences in biological databases are most commonly computational predictions which themselves may rely on computationally-generated data—a process commonly referred to as “transitive annotation.”MethodsIn this manuscript we explore the implications of training a machine learning classifier (the Ribosomal Database Project’s Bayesian classifier in our case) on data that itself has been computationally generated. We generate new training examples based on 16S rRNA data from a metagenomic experiment, and evaluate the extent to which the taxonomic labels predicted by the classifier change after re-training.ResultsWe demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed.Discussion and conclusionsWe conclude with a discussion of key factors that affect the resilience of classifiers to transitively-annotated training data, and propose best practices to avoid the artifacts described in our paper

    NCBI’s virus discovery codeathon: building “FIVE” —the Federated Index of Viral Experiments API index

    Get PDF
    Viruses represent important test cases for data federation due to their genome size and the rapid increase in sequence data in publicly available databases. However, some consequences of previously decentralized (unfederated) data are lack of consensus or comparisons between feature annotations. Unifying or displaying alternative annotations should be a priority both for communities with robust entry representation and for nascent communities with burgeoning data sources. To this end, during this three-day continuation of the Virus Hunting Toolkit codeathon series (VHT-2), a new integrated and federated viral index was elaborated. This Federated Index of Viral Experiments (FIVE) integrates pre-existing and novel functional and taxonomy annotations and virus–host pairings. Variability in the context of viral genomic diversity is often overlooked in virus databases. As a proof-of-concept, FIVE was the first attempt to include viral genome variation for HIV, the most well-studied human pathogen, through viral genome diversity graphs. As per the publication of this manuscript, FIVE is the first implementation of a virus-specific federated index of such scope. FIVE is coded in BigQuery for optimal access of large quantities of data and is publicly accessible. Many projects of database or index federation fail to provide easier alternatives to access or query information. To this end, a Python API query system was developed to enhance the accessibility of FIVE

    Enchancing learning analytics platform for secondary schools: design and development of new Indicators

    No full text
    [CATALÀ] L'Objectiu principal de l'anĂ lisi de l'aprenentatge Ă©s proporcionar conclusions interessants sobre el comportament d'aprenentatge dels Estudiants mitjançant l'Ășs de les Dades en una plataforma d'aprenentatge. Es construeix una parcel·la de registres d'accĂ©s i un conjunt d'indicadors.[ANGLÈS] The main goal of Learning analytics is to provide conclusion and interesting observation on the learning behavior of students' by using the data from a learning platform. It constructs a plot of access logs and a set of indicators for motivation

    DATA-DRIVEN ALGORITHMS FOR CHARACTERIZING STRUCTURAL VARIATION IN METAGENOMIC DATA

    No full text
    Sequence differences between the strains of bacteria comprising host-associated and environmental microbiota may play a role in community assembly and influence the resilience of microbial communities to disturbances. Tools for characterizing strain-level variation within microbial communities, however, are limited in scope, focusing on just single nucleotide poly-morphisms, or relying on reference-based analyses that miss complex structural variants. In this thesis, we describe data-driven methods to characterize variation in metagenomic data. In the first part of the thesis, I present our analysis of the structural variants identified from metagenomic whole genome shotgun sequencing data. I begin by describing the power of assembly graph analysis to detect over 9 million structural variants such as insertion/deletion, repeat copy-number changes, and mobile elements, in nearly 1,000 metagenomes generated as a part of the Human Microbiome Project. Next, I describe Binnacle, which is a structural variant aware binning algorithm. To improve upon the fragmented nature of assemblies, metagenomic binning is performed to cluster contigs that is likely to have originated from the same genome. We show that binning “graph-based” scaffolds, rather than contigs, improves the quality of the bins, and captures a broader set of the genes of the genomes being reconstructed. Finally, we present a case study of the microbial mats from the Yellowstone National Park. The cyanobacterium Synechococcus is abundant in these mats along a stable temperature gradient from ∌ 50oC to ∌ 65oC and plays a key role in fixing carbon and nitrogen. Previous studies have isolated and generated good quality reference sequences of two major Synechococcus spp. that share a very high genomic content; OS-A and OS-B’. Despite the high abundance of the Synechococcus spp., metagenomic assembly of these organisms is challenging due to the large number of rearrangements between them. We explore the genomic diversity of the Synechococcus spp. using a reference genome, reliant assembly and scaffolding. We also highlight that the variants we detect can be used to fingerprint the local biogeography of the hot spring. In the second part of the thesis, I present our analysis of amplicon sequencing data, specifically the 16S rRNA gene sequences. I begin by describing SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate), which is a fast iterative algorithm for clustering large 16S rRNA gene datasets. We also show that SCRAPT is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST and runs orders of magnitude faster than existing methods. Finally, we study the impact of transitive annotation on taxonomic classifiers. Taxonomic labels are assigned using machine learning algorithms trained to recognize individual taxonomic groups based on training sequences with known taxonomic labels. Ideally, the training data should rely on experimentally verified-formal taxonomic labels however, the labels associated with sequences in biological databases are most commonly the result of computational predictions– “transitive annotation.” We demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed. We also discuss key factors that affect the resilience of classifiers to transitively annotated training data and propose best practices to avoid the artifacts described in this thesis

    Enchancing learning analytics platform for secondary schools: design and development of new Indicators

    No full text
    [CATALÀ] L'Objectiu principal de l'anĂ lisi de l'aprenentatge Ă©s proporcionar conclusions interessants sobre el comportament d'aprenentatge dels Estudiants mitjançant l'Ășs de les Dades en una plataforma d'aprenentatge. Es construeix una parcel·la de registres d'accĂ©s i un conjunt d'indicadors.[ANGLÈS] The main goal of Learning analytics is to provide conclusion and interesting observation on the learning behavior of students' by using the data from a learning platform. It constructs a plot of access logs and a set of indicators for motivation
    corecore