88 research outputs found

    Optimal Scaling of Digital Transcriptomes

    Get PDF
    <div><p>Deep sequencing of transcriptomes has become an indispensable tool for biology, enabling expression levels for thousands of genes to be compared across multiple samples. Since transcript counts scale with sequencing depth, counts from different samples must be normalized to a common scale prior to comparison. We analyzed fifteen existing and novel algorithms for normalizing transcript counts, and evaluated the effectiveness of the resulting normalizations. For this purpose we defined two novel and mutually independent metrics: (1) the number of ā€œuniformā€ genes (genes whose normalized expression levels have a sufficiently low coefficient of variation), and (2) low Spearman correlation between normalized expression profiles of gene pairs. We also define four novel algorithms, one of which explicitly maximizes the number of uniform genes, and compared the performance of all fifteen algorithms. The two most commonly used methods (scaling to a fixed total value, or equalizing the expression of certain ā€˜housekeepingā€™ genes) yielded particularly poor results, surpassed even by normalization based on randomly selected gene sets. Conversely, seven of the algorithms approached what appears to be optimal normalization. Three of these algorithms rely on the identification of ā€œubiquitousā€ genes: genes expressed in all the samples studied, but never at very high or very low levels. We demonstrate that these include a ā€œcoreā€ of genes expressed in many tissues in a mutually consistent pattern, which is suitable for use as an internal normalization guide. The new methods yield robustly normalized expression values, which is a prerequisite for the identification of differentially expressed and tissue-specific genes as potential biomarkers.</p></div

    Pairwise comparison of expression levels.

    No full text
    <p>We compared the levels of expression of 15,861 genes with nonzero expression levels in both liver and testes, expressed in terms of average coverage per base. Each point represents one gene. A) Data prior to normalization. Housekeeping genes are highlighted as green points and labeled. The blue and red diagonals represent the relative correction factors computed based on total counts or the NCS method, relative to no normalization (black). The magenta and orange curves depict the percentiles when considering all genes or genes with nonzero values, respectively. B) Values after correction by NCS. Points in black or red denote genes with positive weights, and that therefore guided the scaling. Points in red denote the 39 genes with weight >0.5.</p

    Comparison of performance of various normalization methods.

    No full text
    <p>Each method is evaluated by the number of genes observed to be consistently expressed across samples (abscissa); different methods also yield different numbers of genes identified as specific to one sample. The numbers in the orange circles denote the number of housekeeping genes combined using the geNorm algorithm. The dashed arrows show one stochastic path of the ES from the data prior to normalization (white square, ā€œNoneā€) to the best approximation to the optimal solution (gray square, ES). Brown squares represent the results obtained via the TMM method, using each of the 16 samples as reference.</p

    Additional file 4: of Novel metrics for quantifying bacterial genome composition skews

    No full text
    Table S3. Presence and absence of genes associated with replication, recombination, and repair. Data were retrieved from the KEGG Orthology database [45] for homologous recombination (ko03440), mismatch repair (ko03430), DNA repair and recombination (ko3400), nucleotide excision repair (ko03420), and base excision repair pathways (ko03410). Asterisks denote skew outliers. (XLSX 9 kb

    IBS percentage in different relationships of simulated families.

    No full text
    <p>For this visualization, the sequencing error (SE) parameter was set to zero. (A) Distribution of P2 in an example sibling pair. Siblings have much of the genome that is easily detectable as IBD2, which GRAB detects through a large number of windows with a very high P2 statistic. (B) Number of identity windows (IWs) between pairs of individuals, decreasing with increased relationship degree. (C) Percentage of contiguous IWs. A contiguous IW is any IW adjacent to another IW. Unrelated individuals have fewer contiguous IWs than relatives. (D) Maximum length of a set of contiguous IWs. This length tends to be shorter in distant genetic relationships than close relationships. IT: identical twin. FS: full sibling. PO: parent offspring. UN: unrelated individuals. UD: unknown distance.</p

    Comparison of multiple relationship estimation methods on real WGS families.

    No full text
    <p>Values in the table are the percentage of correct predictions. Values in parentheses are the percentage of predictions within one degree of the true relationship.</p

    Density distribution of Spearman correlations of sample rankings for some normalization methods.

    No full text
    <p>Density distribution of Spearman correlations of sample rankings for some normalization methods.</p

    Comparison between the scaling factors suggested by the different methods.

    No full text
    <p>Lower left: the resulting scaling factors for the heart sample. Upper right: Pairwise correlations between the methods, for all samples. Red shades denote high correlation values (above 0.75), blue denotes low correlation (or anticorrelation). The column to the right indicates the number of uniform genes identified by the method. The Quantile Normalization method is not included in this analysis since it does not produce scaling factors.</p

    Conceptual taxonomy of scaling methods.

    No full text
    <p>Blue: published methods. Pink: variations on published methods. Red: novel methods. Dashed lines connect related methods.</p

    A simulated 26-member, 7-generation pedigree.

    No full text
    <p>Green symbols indicate founders that were sequenced by CGI, and purple ones indicate children whose genotyping were simulated. The topology of the pedigree was chosen to enable testing of diverse relationship estimations.</p
    • ā€¦
    corecore