15 research outputs found

    Classifying short genomic fragments from novel lineages using composition and homology

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using <it>rank-specific </it>classifiers, which assign reads to taxonomic labels from a predetermined level such as named species or strain, or <it>rank-flexible </it>classifiers, which choose an appropriate taxonomic rank for each sequence in a data set. The choice of rank typically depends on the optimal model for a given sequence and on the breadth of taxonomic groups seen in a set of close-to-optimal models. Homology-based (<it>e.g</it>., LCA) and composition-based (<it>e.g</it>., PhyloPythia, TACOA) rank-flexible classifiers have been proposed, but there is at present no hybrid approach that utilizes both homology and composition.</p> <p>Results</p> <p>We first develop a hybrid, rank-specific classifier based on BLAST and Naïve Bayes (NB) that has comparable accuracy and a faster running time than the current best approach, PhymmBL. By substituting LCA for BLAST or allowing the inclusion of suboptimal NB models, we obtain a rank-flexible classifier. This hybrid classifier outperforms established rank-flexible approaches on simulated metagenomic fragments of length 200 bp to 1000 bp and is able to assign taxonomic attributions to a subset of sequences with few misclassifications. We then demonstrate the performance of different classifiers on an enhanced biological phosphorous removal metagenome, illustrating the advantages of rank-flexible classifiers when representative genomes are absent from the set of reference genomes. Application to a glacier ice metagenome demonstrates that similar taxonomic profiles are obtained across a set of classifiers which are increasingly conservative in their classification.</p> <p>Conclusions</p> <p>Our NB-based classification scheme is faster than the current best composition-based algorithm, Phymm, while providing equally accurate predictions. The rank-flexible variant of NB, which we term Δ-NB, is complementary to LCA and can be combined with it to yield conservative prediction sets of very high confidence. The simple parameterization of LCA and Δ-NB allows for tuning of the balance between more predictions and increased precision, allowing the user to account for the sensitivity of downstream analyses to misclassified or unclassified sequences.</p

    Simultaneous genome sequencing of symbionts and their hosts

    Get PDF
    Second-generation sequencing has made possible the sequencing of genomes of interest for even small research groups. However, obtaining separate clean cultures and clonal or inbred samples of metazoan hosts and their bacterial symbionts is often difficult. We present a computational pipeline for separating metazoan and bacterial DNA in silico rather than at the bench. The method relies on the generation of deep coverage of all the genomes in a mixed sample using Illumina short-read sequencing technology, and using aggregate properties of the different genomes to identify read sets belonging to each. This inexpensive and rapid approach has been used to sequence several nematode genomes and their bacterial endosymbionts in the last year in our laboratory and can also be used to visualize and identify unexpected contaminants (or possible symbionts) in genomic DNA samples. We hope that this method will enable researchers studying symbiotic systems to move from gene-centric to genome-centric approaches

    Metagenome – Processing and Analysis

    Get PDF
    Metagenome means “multiple genomes” and the study of culture independent genomic content in environment is called metagenomics. Because of the advent of powerful and economic next generation sequencing technology, sequencing has become cheaper and faster and thus the study of genes and phenotypes is transitioning from single organism to that of a community present in the natural environmental sample. Once sequence data are obtained from an environmental sample, the challenge is to process, assemble and bin the metagenome data in order to get as accurate and complete a representation of the populations present in the community or to get high confident draft assembly. In this paper we describe the existing bioinformatics workflow to process the metagenomic data. Next, we examine one way of parallelizing the sequence similarity program on a High Performance Computing (HPC) cluster since sequence similarity is the most common and frequently used technique throughout the metagenome data processing and analyzing steps. In order to address the challenges involved in analyzing the result file obtained from sequence similarity program, we developed a web application tool called Contig Analysis Tool (CAT). Later, we applied the tools and techniques to the real world virome metagenomic data i.e., to the genomes of all the viruses present in the environmental sample obtained from microbial mats derived from hot springs in Yellowstone National Park. There are several challenges associated with the assembly and binning of virome data particularly because of the following reasons: 1. Not many viral sequence data in the existing databases for sequence similarity. 2. No reference genome 3. No phylogenetic marker genes like the ones present in the bacteria and archaea. We will see how we overcame these problems by performing sequence similarity using CRISPR data and sequence composition using tetranucleotide analysis

    Computational meta'omics for microbial community studies

    Get PDF
    Complex microbial communities are an integral part of the Earth's ecosystem and of our bodies in health and disease. In the last two decades, culture-independent approaches have provided new insights into their structure and function, with the exponentially decreasing cost of high-throughput sequencing resulting in broadly available tools for microbial surveys. However, the field remains far from reaching a technological plateau, as both computational techniques and nucleotide sequencing platforms for microbial genomic and transcriptional content continue to improve. Current microbiome analyses are thus starting to adopt multiple and complementary meta'omic approaches, leading to unprecedented opportunities to comprehensively and accurately characterize microbial communities and their interactions with their environments and hosts. This diversity of available assays, analysis methods, and public data is in turn beginning to enable microbiome-based predictive and modeling tools. We thus review here the technological and computational meta'omics approaches that are already available, those that are under active development, their success in biological discovery, and several outstanding challenges

    Dietary Energy Level Promotes Rumen Microbial Protein Synthesis by Improving the Energy Productivity of the Ruminal Microbiome

    Get PDF
    Improving the yield of rumen microbial protein (MCP) has significant importance in the promotion of animal performance and the reduction of protein feed waste. The amount of energy supplied to rumen microorganisms is an important factor affecting the amount of protein nitrogen incorporated into rumen MCP. Substrate-level phosphorylation (SLP) and electron transport phosphorylation (ETP) are two major mechanisms of energy generation within microbial cells. However, the way that energy and protein levels in the diet impact the energy productivity of the ruminal microbiome and, thereafter, rumen MCP yields is not known yet. In present study, we have investigated, by animal experiments and metagenome shotgun sequencing, the effects of energy-rich and protein-rich diets on rumen MCP yields, as well as SLP-coupled and ETP-coupled energy productivity of the ruminal microbiome. We have found that an energy-rich diet induces a significant increase in rumen MCP yield, whereas a protein-rich diet has no significant impacts on it. Based on 10 reconstructed pathways related to the energy metabolism of the ruminal microbiome, we have determined that the energy-rich diet induces significant increases in the total abundance of SLP enzymes coupled to the nicotinamide adenine dinucleotide (NADH) oxidation in the glucose fermentation and F-type ATPase of the electron transporter chain, whereas the protein-rich diet has no significant impact in the abundance of these enzymes. At the species level, the energy-rich diet induces significant increases in the total abundance of 15 ETP-related genera and 40 genera that have SLP-coupled fermentation pathways, whereas the protein-rich diet has no significant impact on the total abundance of these genera. Our results suggest that an increase in dietary energy levels promotes rumen energy productivity and MCP yield by improving levels of ETP and SLP coupled to glucose fermentation in the ruminal microbiome. But, an increase in dietary protein level has no such effects

    Expansion of Microbial Virology by Impetus of the Reduction of Viral Dark Matter

    Get PDF
    Modern metagenomic methods have rapidly accelerated the rate of viral discovery. Currently, to discover a novel virus, deep sequencing reads must align to a known reference virus. While alignment is effective at identifying closely related viruses, highly divergent viruses can often share no discernable sequence alignment with known viruses. Therefore, the accurate classification of viral dark matter – metagenomic sequences that originate from viruses but do not align to any reference virus sequences – is one of the major obstacles in not only discovering novel viruses, but also by extension, comprehensively defining the virome. As viral dark matter results fundamentally from a failure to align sequence reads, two major contributors to viral dark matter include 1) the lack of diversity in specific viral families and 2) the reliance on alignment as a metric to define viral taxonomy. In this dissertation, I address each of these issues. These projects resulted in a massive expansion in understanding of microbial virus diversity, which led me to further interrogate the biology of microbial viruses. Specifically, I attempted to identify novel antiviral mechanisms against RNA bacteriophages and possibly identify a novel family of RNA bacteriophages. First, I addressed the underrepresentation of viral sequences in databases by identifying a specific underrepresented class of virus, bacteriophages with RNA genomes, and systematically discovered highly divergent novel RNA bacteriophages in previously sequenced data. I identified 161 partial genome sequences from at least 122 RNA bacteriophage phylotypes that are highly divergent from each other and from previously described RNA bacteriophages. These partial genome sequences displayed multiple novel genome organizations previously unknown for RNA bacteriophages, and in aggregate, encoded 91 open reading frames (ORFs) that did not align to any known protein; sequences related to these ORFs would be described as viral dark matter in absentia of this systematic discovery effort. This new level RNA bacteriophage diversity suggested that RNA bacteriophages might be major predators of bacteria in the environment. In turn, this would suggest that there might be active resistance mechanisms in bacteria that specifically antagonize RNA bacteriophages; as of now however, there are no active mechanisms known in bacteria that can antagonize RNA bacteriophages. Therefore, one goal was to identify bacterial genes that can restrict RNA bacteriophage infection. I performed a functional metagenomic screen to identify RNA phage resistance genes. From this, I identified four genes that conferred resistance to the RNA phages, QÎČ and MS2 but not the RNA phage C1. Additionally, this expansion of RNA bacteriophage diversity suggests that there might be new families of RNA bacteriophages that are unrelated to the previously discovered RNA bacteriophages. One candidate eukaryotic viral family that might in fact be RNA bacteriophages are Picobirnaviridae. Picobirnaviruses are bisegmented RNA viruses that are highly prevalent in stool. By analyzing previously sequenced datasets, I discovered multiple new picobirnavirus segments. From analyzing the upstream regions of the ORFs on these segments, I found that almost all of the ORFs are preceded by a bacterial ribosomal binding sequence. This conservation of bacterial ribosomal binding sequences suggests that these viruses might infect bacteria. I then unsuccessfully tried to show that Human Picobirnavirus can replicate in bacterial cells. Second, I addressed the reliance on alignment based algorithms by developing a novel alignment-independent algorithm to identify viral sequences. This algorithm, DiscoVir, is a support vector machine (SVM) model that relies on nucleotide k-mer frequencies to discriminate sequences of novel, highly disparate eukaryotic viruses from prokaryotic and fungal sequences. I validated in silico that DiscoVir can identify viruses from novel viral taxa and that it outperforms BLASTx for almost all viral families. When applied to an authentic metagenomic dataset, DiscoVir identified two additional contigs that corresponded to two undetected segments of a novel bunya-like virus. By selectively culturing fungi from this serum sample, I identified an isolate of Penicillium atramentosum that contained all three viral RNA segments, thus suggesting that this fungal isolate was in fact the host of this novel virus. I sequenced the whole genome of this novel virus and demonstrated that the terminal nucleotide sequences were conserved between the three segments, and these sequences were consistent with the termini of bunyaviruses in the genera Phlebovirus and Tenuivirus. Thus, application of DiscoVir played a critical role in the identification of the first segmented negative stranded RNA virus infection of a fungus. Taken together, I have contributed to the systematic reduction of viral dark matter using two different approaches, both of which enable future researchers to identify a much more diverse repertoire of viruses than previously possible. This increased ability to identify highly divergent viruses will better enable the metagenomics community to accurately identify the role of viruses in larger biological processes, including but not limited to, human disease
    corecore