11 research outputs found

    Dinucleotide distance histograms for fast detection of rRNA in metatranscriptomic sequences

    Get PDF
    With the advent of metatranscriptomics it has now become possible to study the dynamics of microbial communities. The analysis of environmental RNA-Seq data implies several challenges for the development of efficient tools in bioinformatics. One of the first steps in the computational analysis of metatranscriptomic sequencing reads requires the separation of rRNA and mRNA fragments to ensure that only protein coding sequences are actually used in a subsequent functional analysis. In the context of the rRNA filtering task it is desirable to have a broad spectrum of different methods in order to find a suitable trade-off between speed and accuracy for a particular dataset. We introduce a machine learning approach for the detection of rRNA in metatranscriptomic sequencing reads that is based on support vector machines in combination with dinucleotide distance histograms for feature representation. The results show that our SVM-based approach is at least one order of magnitude faster than any of the existing tools with only a slight degradation of the detection performance when compared to state-of-the-art alignment-based methods

    Critical Assessment of Metagenome Interpretation:A benchmark of metagenomics software

    Get PDF
    International audienceIn metagenome analysis, computational methods for assembly, taxonomic profilingand binning are key components facilitating downstream biological datainterpretation. However, a lack of consensus about benchmarking datasets andevaluation metrics complicates proper performance assessment. The CriticalAssessment of Metagenome Interpretation (CAMI) challenge has engaged the globaldeveloper community to benchmark their programs on datasets of unprecedentedcomplexity and realism. Benchmark metagenomes were generated from newlysequenced ~700 microorganisms and ~600 novel viruses and plasmids, includinggenomes with varying degrees of relatedness to each other and to publicly availableones and representing common experimental setups. Across all datasets, assemblyand genome binning programs performed well for species represented by individualgenomes, while performance was substantially affected by the presence of relatedstrains. Taxonomic profiling and binning programs were proficient at high taxonomicranks, with a notable performance decrease below the family level. Parametersettings substantially impacted performances, underscoring the importance ofprogram reproducibility. While highlighting current challenges in computationalmetagenomics, the CAMI results provide a roadmap for software selection to answerspecific research questions

    How to normalize metatranscriptomic count data for differential expression analysis

    No full text
    Background Differential expression analysis on the basis of RNA-Seq count data has become a standard tool in transcriptomics. Several studies have shown that prior normalization of the data is crucial for a reliable detection of transcriptional differences. Until now it has not been clear whether and how the transcriptomic approach can be used for differential expression analysis in metatranscriptomics. Methods We propose a model for differential expression in metatranscriptomics that explicitly accounts for variations in the taxonomic composition of transcripts across different samples. As a main consequence the correct normalization of metatranscriptomic count data under this model requires the taxonomic separation of the data into organism-specific bins. Then the taxon-specific scaling of organism profiles yields a valid normalization and allows us to recombine the scaled profiles into a metatranscriptomic count matrix. This matrix can then be analyzed with statistical tools for transcriptomic count data. For taxon-specific scaling and recombination of scaled counts we provide a simple R script. Results When applying transcriptomic tools for differential expression analysis directly to metatranscriptomic data with an organism-independent (global) scaling of counts the resulting differences may be difficult to interpret. The differences may correspond to changing functional profiles of the contributing organisms but may also result from a variation of taxonomic abundances. Taxon-specific scaling eliminates this variation and therefore the resulting differences actually reflect a different behavior of organisms under changing conditions. In simulation studies we show that the divergence between results from global and taxon-specific scaling can be drastic. In particular, the variation of organism abundances can imply a considerable increase of significant differences with global scaling. Also, on real metatranscriptomic data, the predictions from taxon-specific and global scaling can differ widely. Our studies indicate that in real data applications performed with global scaling it might be impossible to distinguish between differential expression in terms of transcriptomic changes and differential composition in terms of changing taxonomic proportions. Conclusions As in transcriptomics, a proper normalization of count data is also essential for differential expression analysis in metatranscriptomics. Our model implies a taxon-specific scaling of counts for normalization of the data. The application of taxon-specific scaling consequently removes taxonomic composition variations from functional profiles and therefore provides a clear interpretation of the observed functional differences

    Exploring Neighborhoods in the Metagenome Universe

    No full text
    The variety of metagenomes in current databases provides a rapidly growing source of information for comparative studies. However, the quantity and quality of supplementary metadata is still lagging behind. It is therefore important to be able to identify related metagenomes by means of the available sequence data alone. We have studied efficient sequence-based methods for large-scale identification of similar metagenomes within a database retrieval context. In a broad comparison of different profiling methods we found that vector-based distance measures are well-suitable for the detection of metagenomic neighbors. Our evaluation on more than 1700 publicly available metagenomes indicates that for a query metagenome from a particular habitat on average nine out of ten nearest neighbors represent the same habitat category independent of the utilized profiling method or distance measure. While for well-defined labels a neighborhood accuracy of 100% can be achieved, in general the neighbor detection is severely affected by a natural overlap of manually annotated categories. In addition, we present results of a novel visualization method that is able to reflect the similarity of metagenomes in a 2D scatter plot. The visualization method shows a similarly high accuracy in the reduced space as compared with the high-dimensional profile space. Our study suggests that for inspection of metagenome neighborhoods the profiling methods and distance measures can be chosen to provide a convenient interpretation of results in terms of the underlying features. Furthermore, supplementary metadata of metagenome samples in the future needs to comply with readily available ontologies for fine-grained and standardized annotation. To make profile-based k-nearest-neighbor search and the 2D-visualization of the metagenome universe available to the research community, we included the proposed methods in our CoMet-Universe server for comparative metagenome analysis

    Cysteine-Rich Angiogenic Inducer 61 Improves Prognostic Accuracy of GRACE (Global Registry of Acute Coronary Events) 2.0 Risk Score in Patients With Acute Coronary Syndromes

    Full text link
    Background It remains unclear whether the novel biomarker cysteine-rich angiogenic inducer 61 (CCN1) adds incremental prognostic value to the GRACE 2.0 (Global Registry of Acute Coronary Events) risk score and biomarkers high-sensitivity Troponin T, hsCRP (high-sensitivity C-reactive protein), and NT-proBNP (N-terminal pro-B-type natriuretic peptide) in patients with acute coronary syndromes. Methods and Results Patients referred for coronary angiography with a primary diagnosis of acute coronary syndromes were enrolled in the Special Program University Medicine - Acute Coronary Syndromes and Inflammation cohort. The primary/secondary end points were 30-day/1-year all-cause mortality and the composite of all-cause mortality or myocardial infarction as used in the GRACE risk score. Associations between biomarkers and outcome were assessed using log-transformed biomarker values and the GRACE risk score (versions 1.0 and 2.0). The incremental value of CCN1 beyond a reference model was assessed using Harrell's C-statistics calculated from a Cox proportional-hazard model. The P value of the C-statistics was derived from a likelihood ratio test. Among 2168 patients recruited, 1732 could be analyzed. CCN1 was the strongest single predictor of all-cause mortality at 30 days (hazard ratio [HR], 1.77 [1.31, 2.40]) and 1 year (HR, 1.81 [1.47, 2.22]). Adding CCN1 alone to the GRACE 2.0 risk score improved C-statistics for prognostic accuracy of all-cause mortality at 30 days (0.87-0.88) and 1 year (0.81-0.82) and when combined with high-sensitivity Troponin T, hsCRP, NT-proBNP for 30 days (0.87-0.91), and for 1-year follow-up (0.81-0.84). CCN1 also increased the prognostic value for the composite of all-cause mortality or myocardial infarction. Conclusions CCN1 predicts adverse outcomes in patients with acute coronary syndromes adding incremental information to the GRACE risk score, suggesting distinct underlying molecular mechanisms. Registration URL: https://www.clinicaltrials.gov. Unique identifier: NCT01000701

    Critical assessment of metagenome interpretation − a benchmark of computational metagenomics software

    No full text
    In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions
    corecore