25 research outputs found

    Binning sequences using very sparse labels within a metagenome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, <it>k</it>-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity.</p> <p>Results</p> <p>The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds.</p> <p>The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, <it>k</it>-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all <it>k</it>-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the ā‰„ 10 reads datasets and comparable in the ā‰„ 8 kb benchmark tests.</p> <p>Conclusion</p> <p>In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.</p

    cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data

    Get PDF
    Summary: Huge amount of metagenomic sequence data have been produced as a result of the rapidly increasing efforts worldwide in studying microbial communities as a whole. Most, if not all, sequenced metagenomes are complex mixtures of chromosomal and plasmid sequence fragments from multiple organisms, possibly from different kingdoms. Computational methods for prediction of genomic elements such as genes are significantly different for chromosomes and plasmids, hence raising the need for separation of chromosomal from plasmid sequences in a metagenome. We present a program for classification of a metagenome set into chromosomal and plasmid sequences, based on their distinguishing pentamer frequencies. On a large training set consisting of all the sequenced prokaryotic chromosomes and plasmids, the program achieves āˆ¼92% in classification accuracy. On a large set of simulated metagenomes with sequence lengths ranging from 300 bp to 100 kbp, the program has classification accuracy from 64.45% to 88.75%. On a large independent test set, the program achieves 88.29% classification accuracy

    Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data

    Full text link
    Background: Identification of biological specimens is a major requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but generally do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances. Results: We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on genome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% identification accuracy at supra-species level and 78% accuracy for species level. Discussion: CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data. This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design).Comment: 47 pages, 13 figure

    Metagenomic Sequencing of an In Vitro-Simulated Microbial Community

    Get PDF
    Background: Microbial life dominates the earth, but many species are difficult or even impossible to study under laboratory conditions. Sequencing DNA directly from the environment, a technique commonly referred to as metagenomics, is an important tool for cataloging microbial life. This culture-independent approach involves collecting samples that include microbes in them, extracting DNA from the samples, and sequencing the DNA. A sample may contain many different microorganisms, macroorganisms, and even free-floating environmental DNA. A fundamental challenge in metagenomics has been estimating the abundance of organisms in a sample based on the frequency with which the organism's DNA was observed in reads generated via DNA sequencing. Methodology/Principal Findings: We created mixtures of ten microbial species for which genome sequences are known. Each mixture contained an equal number of cells of each species. We then extracted DNA from the mixtures, sequenced the DNA, and measured the frequency with which genomic regions from each organism was observed in the sequenced DNA. We found that the observed frequency of reads mapping to each organism did not reflect the equal numbers of cells that were known to be included in each mixture. The relative organism abundances varied significantly depending on the DNA extraction and sequencing protocol utilized. Conclusions/Significance: We describe a new data resource for measuring the accuracy of metagenomic binning methods, created by in vitro-simulation of a metagenomic community. Our in vitro simulation can be used to complement previous in silico benchmark studies. In constructing a synthetic community and sequencing its metagenome, we encountered several sources of observation bias that likely affect most metagenomic experiments to date and present challenges for comparative metagenomic studies. DNA preparation methods have a particularly profound effect in our study, implying that samples prepared with different protocols are not suitable for comparative metagenomics

    The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods.</p> <p>Results</p> <p>We compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency.</p> <p>Conclusion</p> <p>We have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms.</p

    RAIphy: Phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles

    Get PDF
    Background: Computational analysis of metagenomes requires the taxonomical assignment of the genome contigs assembled from DNA reads of environmental samples. Because of the diverse nature of microbiomes, the length of the assemblies obtained can vary between a few hundred bp to a few hundred Kbp. Current taxonomic classification algorithms provide accurate classification for long contigs or for short fragments from organisms that have close relatives with annotated genomes. These are significant limitations for metagenome analysis because of the complexity of microbiomes and the paucity of existing annotated genomes. Results: We propose a robust taxonomic classification method, RAIphy, that uses a novel sequence similarity metric with iterative refinement of taxonomic models and functions effectively without these limitations. We have tested RAIphy with synthetic metagenomics data ranging between 100 bp to 50 Kbp. Within a sequence read range of 100 bp-1000 bp, the sensitivity of RAIphy ranges between 38%-81% outperforming the currently popular composition-based methods for reads in this range. Comparison with computationally more intensive sequence similarity methods shows that RAIphy performs competitively while being significantly faster. The sensitivityspecificity characteristics for relatively longer contigs were compared with the PhyloPythia and TACOA algorithms. RAIphy performs better than these algorithms at varying clade-levels. For an acid mine drainage (AMD) metagenome, RAIphy was able to taxonomically bin the sequence read set more accurately than the currently available methods, Phymm and MEGAN, and more accurately in two out of three tests than the much more computationally intensive method, PhymmBL. Conclusions: With the introduction of the relative abundance index metric and an iterative classification method, we propose a taxonomic classification algorithm that performs competitively for a large range of DNA contig lengths assembled from metagenome data. Because of its speed, simplicity, and accuracy RAIphy can be successfully used in the binning process for a broad range of metagenomic data obtained from environmental samples

    Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes

    Get PDF
    BACKGROUND: Investigating and understanding drug-drug interactions (DDIs) is important in improving the effectiveness of clinical care. DDIs can occur when two or more drugs are administered together. Experimentally based DDI detection methods require a large cost and time. Hence, there is a great interest in developing efficient and useful computational methods for inferring potential DDIs. Standard binary classifiers require both positives and negatives for training. In a DDI context, drug pairs that are known to interact can serve as positives for predictive methods. But, the negatives or drug pairs that have been confirmed to have no interaction are scarce. To address this lack of negatives, we introduce a Positive-Unlabeled Learning method for inferring potential DDIs. RESULTS: The proposed method consists of three steps: i) application of Growing Self Organizing Maps to infer negatives from the unlabeled dataset; ii) using a pairwise similarity function to quantify the overlap between individual features of drugs and iii) using support vector machine classifier for inferring DDIs. We obtained 6036 DDIs from DrugBank database. Using the proposed approach, we inferred 589 drug pairs that are likely to not interact with each other; these drug pairs are used as representative data for the negative class in binary classification for DDI prediction. Moreover, we classify the predicted DDIs as Cytochrome P450 (CYP) enzyme-Dependent and CYP-Independent interactions invoking their locations on the Growing Self Organizing Map, due to the particular importance of these enzymes in clinically significant interaction effects. Further, we provide a case study on three predicted CYP-Dependent DDIs to evaluate the clinical relevance of this study. CONCLUSION: Our proposed approach showed an absolute improvement in F1-score of 14 and 38% in comparison to the method that randomly selects unlabeled data points as likely negatives, depending on the choice of similarity function. We inferred 5300 possible CYP-Dependent DDIs and 592 CYP-Independent DDIs with the highest posterior probabilities. Our discoveries can be used to improve clinical care as well as the research outcomes of drug development

    Metagenomics - a guide from sampling to data analysis

    Get PDF
    Metagenomics applies a suite of genomic technologies and bioinformatics tools to directly access the genetic content of entire communities of organisms. The field of metagenomics has been responsible for substantial advances in microbial ecology, evolution, and diversity over the past 5 to 10 years, and many research laboratories are actively engaged in it now. With the growing numbers of activities also comes a plethora of methodological knowledge and expertise that should guide future developments in the field. This review summarizes the current opinions in metagenomics, and provides practical guidance and advice on sample processing, sequencing technology, assembly, binning, annotation, experimental design, statistical analysis, data storage, and data sharing. As more metagenomic datasets are generated, the availability of standardized procedures and shared data storage and analysis becomes increasingly important to ensure that output of individual projects can be assessed and compared
    corecore