12 research outputs found

    Comparison of normalization methods for the analysis of metagenomic gene abundance data

    Get PDF
    Background: In shotgun metagenomics, microbial communities are studied through direct sequencing of DNA without any prior cultivation. By comparing gene abundances estimated from the generated sequencing reads, functional differences between the communities can be identified. However, gene abundance data is affected by high levels of systematic variability, which can greatly reduce the statistical power and introduce false positives. Normalization, which is the process where systematic variability is identified and removed, is therefore a vital part of the data analysis. A wide range of normalization methods for high-dimensional count data has been proposed but their performance on the analysis of shotgun metagenomic data has not been evaluated. Results: Here, we present a systematic evaluation of nine normalization methods for gene abundance data. The methods were evaluated through resampling of three comprehensive datasets, creating a realistic setting that preserved the unique characteristics of metagenomic data. Performance was measured in terms of the methods ability to identify differentially abundant genes (DAGs), correctly calculate unbiased p-values and control the false discovery rate (FDR). Our results showed that the choice of normalization method has a large impact on the end results. When the DAGs were asymmetrically present between the experimental conditions, many normalization methods had a reduced true positive rate (TPR) and a high false positive rate (FPR). The methods trimmed mean of M-values (TMM) and relative log expression (RLE) had the overall highest performance and are therefore recommended for the analysis of gene abundance data. For larger sample sizes, CSS also showed satisfactory performance. Conclusions: This study emphasizes the importance of selecting a suitable normalization methods in the analysis of data from shotgun metagenomics. Our results also demonstrate that improper methods may result in unacceptably high levels of false positives, which in turn may lead to incorrect or obfuscated biological interpretation

    Evaluation of Normalization Methods for Metagenomic Data

    No full text

    HattCI: Fast and Accurate attC site Identification Using Hidden Markov Models.

    No full text
    Integrons are genetic elements that facilitate the horizontal gene transfer in bacteria and are known to harbor genes associated with antibiotic resistance. The gene mobility in the integrons is governed by the presence of attC sites, which are 55 to 141-nucleotide-long imperfect inverted repeats. Here we present HattCI, a new method for fast and accurate identification of attC sites in large DNA data sets. The method is based on a generalized hidden Markov model that describes each core component of an attC site individually. Using twofold cross-validation experiments on a manually curated reference data set of 231 attC sites from class 1 and 2 integrons, HattCI showed high sensitivities of up to 91.9% while maintaining satisfactory false-positive rates. When applied to a metagenomic data set of 35 microbial communities from different environments, HattCI found a substantially higher number of attC sites in the samples that are known to contain more horizontally transferred elements. HattCI will significantly increase the ability to identify attC sites and thus integron-mediated genes in genomic and metagenomic data. HattCI is implemented in C and is freely available at http://bioinformatics.math.chalmers.se/HattCI

    Additional file 9 of Comparison of normalization methods for the analysis of metagenomic gene abundance data

    Get PDF
    Figure S6. True false discovery rate for p-values adjusted using Storey q-values method at an estimated false discovery rate of 0.05 (y-axis) for different distribution of effects between groups (x-axis): balanced (‘B’) with 10% of effects divided equally between the two groups, lightly-unbalanced (‘LU’) with effects added 75–25% in each group, unbalanced (‘U’) with all effects added to only one group, and heavily-unbalanced (‘HU’) with 20% of effects added to only one group. The results were based on resampled data consisting of two groups with 10 samples in each, and an average fold-change of 3. Three metagenomic datasets were used Human gut I, Human gut II and Marine. The following methods are included in the figure trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), quantile-quantile (Quant), upper quartile (UQ), median (Med), total count (TC) and rarefying (Rare). (PDF 132 kb

    Additional file 8 of Comparison of normalization methods for the analysis of metagenomic gene abundance data

    No full text
    Figure S5. True false discovery rate for p-values adjusted using Benjamini-Yekutieli method at an estimated false discovery rate of 0.05 (y-axis) for different distribution of effects between groups (x-axis): balanced (‘B’) with 10% of effects divided equally between the two groups, lightly-unbalanced (’LU’) with effects added 75%-25% in each group, unbalanced (‘U’) with all effects added to only one group, and heavily-unbalanced (’HU’) with 20% of effects added to only one group. The results were based on resampled data consisting of two groups with 10 samples in each, and an average fold-change of 3. Three metagenomic datasets were used Human gut I, Human gut II and Marine. The following methods are included in the figure trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), quantile-quantile (Quant), upper quartile (UQ), median (Med), total count (TC) and rarefying (Rare). (PDF 40 kb

    Additional file 2 of Comparison of normalization methods for the analysis of metagenomic gene abundance data

    No full text
    Figure S2. Scatterplot of normalization factors for each pair of scaling methods. Normalization factors estimated per sample in the Human gut I, for group size 10+10, with 10% of effects divided equally between the two group, and fold-change 3. Affected genes were randomly selected in 100 iterations. The number on the top-left of each plot indicates the Spearman correlation for the normalization factors presented in the plot. The following methods are included in the figure trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), upper quartile (UQ), median (Med) and total count (TC). (PDF 316 kb

    Additional file 1 of Comparison of normalization methods for the analysis of metagenomic gene abundance data

    No full text
    Figure S1. Histograms of Spearman correlations between normalization factors and raw counts of non-differentially abundant genes (non-DAGs). Spearman correlations were compute per gene in the Human gut I, for group size 10+10, with 10% of effects divided equally between the two group, and fold-change 3. Affected genes were randomly selected in 100 iterations. The following methods are included in the figure trimmed mean of M-values (TMM), relative log expression (RLE), cumulative sum scaling (CSS), reversed cumulative sum scaling (RCSS), upper quartile (UQ), median (Med) and total count (TC). (PDF 76 kb

    Additional file 7 of Comparison of normalization methods for the analysis of metagenomic gene abundance data

    No full text
    Table S3. True false discovery rate at an estimated false discovery rate of 0.05 for a group size of 10+10. (PDF 16 kb
    corecore