112 research outputs found
Optimal Scaling of Digital Transcriptomes
<div><p>Deep sequencing of transcriptomes has become an indispensable tool for biology, enabling expression levels for thousands of genes to be compared across multiple samples. Since transcript counts scale with sequencing depth, counts from different samples must be normalized to a common scale prior to comparison. We analyzed fifteen existing and novel algorithms for normalizing transcript counts, and evaluated the effectiveness of the resulting normalizations. For this purpose we defined two novel and mutually independent metrics: (1) the number of “uniform” genes (genes whose normalized expression levels have a sufficiently low coefficient of variation), and (2) low Spearman correlation between normalized expression profiles of gene pairs. We also define four novel algorithms, one of which explicitly maximizes the number of uniform genes, and compared the performance of all fifteen algorithms. The two most commonly used methods (scaling to a fixed total value, or equalizing the expression of certain ‘housekeeping’ genes) yielded particularly poor results, surpassed even by normalization based on randomly selected gene sets. Conversely, seven of the algorithms approached what appears to be optimal normalization. Three of these algorithms rely on the identification of “ubiquitous” genes: genes expressed in all the samples studied, but never at very high or very low levels. We demonstrate that these include a “core” of genes expressed in many tissues in a mutually consistent pattern, which is suitable for use as an internal normalization guide. The new methods yield robustly normalized expression values, which is a prerequisite for the identification of differentially expressed and tissue-specific genes as potential biomarkers.</p></div
Pairwise comparison of expression levels.
<p>We compared the levels of expression of 15,861 genes with nonzero expression levels in both liver and testes, expressed in terms of average coverage per base. Each point represents one gene. A) Data prior to normalization. Housekeeping genes are highlighted as green points and labeled. The blue and red diagonals represent the relative correction factors computed based on total counts or the NCS method, relative to no normalization (black). The magenta and orange curves depict the percentiles when considering all genes or genes with nonzero values, respectively. B) Values after correction by NCS. Points in black or red denote genes with positive weights, and that therefore guided the scaling. Points in red denote the 39 genes with weight >0.5.</p
Comparison of performance of various normalization methods.
<p>Each method is evaluated by the number of genes observed to be consistently expressed across samples (abscissa); different methods also yield different numbers of genes identified as specific to one sample. The numbers in the orange circles denote the number of housekeeping genes combined using the geNorm algorithm. The dashed arrows show one stochastic path of the ES from the data prior to normalization (white square, “None”) to the best approximation to the optimal solution (gray square, ES). Brown squares represent the results obtained via the TMM method, using each of the 16 samples as reference.</p
Density distribution of Spearman correlations of sample rankings for some normalization methods.
<p>Density distribution of Spearman correlations of sample rankings for some normalization methods.</p
Comparison between the scaling factors suggested by the different methods.
<p>Lower left: the resulting scaling factors for the heart sample. Upper right: Pairwise correlations between the methods, for all samples. Red shades denote high correlation values (above 0.75), blue denotes low correlation (or anticorrelation). The column to the right indicates the number of uniform genes identified by the method. The Quantile Normalization method is not included in this analysis since it does not produce scaling factors.</p
Conceptual taxonomy of scaling methods.
<p>Blue: published methods. Pink: variations on published methods. Red: novel methods. Dashed lines connect related methods.</p
A Systems Approach to Rheumatoid Arthritis
<div><p>Rheumatoid arthritis (RA) is a chronic autoimmune disease that primarily attacks synovial joints. Despite the advances in diagnosis and treatment of RA, novel molecular targets are still needed to improve the accuracy of diagnosis and the therapeutic outcomes. Here, we present a systems approach that can effectively 1) identify core RA-associated genes (RAGs), 2) reconstruct RA-perturbed networks, and 3) select potential targets for diagnosis and treatments of RA. By integrating multiple gene expression datasets previously reported, we first identified 983 core RAGs that show RA dominant differential expression, compared to osteoarthritis (OA), in the multiple datasets. Using the core RAGs, we then reconstructed RA-perturbed networks that delineate key RA associated cellular processes and transcriptional regulation. The networks revealed that synovial fibroblasts play major roles in defining RA-perturbed processes, anti-TNF-α therapy restored many RA-perturbed processes, and 19 transcription factors (TFs) have major contribution to deregulation of the core RAGs in the RA-perturbed networks. Finally, we selected a list of potential molecular targets that can act as metrics or modulators of the RA-perturbed networks. Therefore, these network models identify a panel of potential targets that will serve as an important resource for the discovery of therapeutic targets and diagnostic markers, as well as providing novel insights into RA pathogenesis.</p> </div
RA associated genes, cellular processes and disease phenotypes.
<p>A) and B) Seven major clusters (1, 2, 3, 4, 6, 8, 12) showing the DEPs of the RAGs in RA and OA samples: Shared (shared RAGs commonly up- or down-regulated in RA and OA samples; RA-dominant (RAGs dominantly up- or down-regulated in RA samples). The number of RAGs in each cluster is denoted in the table. When a gene shows a mixture of the DEPs in the multiple clusters, NMF, as a soft clustering method, assigns the gene to multiple clusters. Thus, the summation of the RA-dominant up-regulated RAGs (1104 RAGs) could be larger than 983 presented as the number of the RA-dominant RAGs. C) GO Biological Processes (GOBPs) enriched by the up-regulated RAGs (<i>P</i><0.05). For each GOBP, a Z score was computed by <i>N</i><sup>−1</sup>(1-<i>P</i>) where <i>N</i><sup>−1</sup>(−) is the inverse of a standard normal cumulative density function and <i>P</i> is the enrichment p-value for the GOBP. Empty and gray bars represent the GOBPs enriched by shared and RA-dominant up-regulated RAGs, respectively. D) Five classes of RA-related diseases and their association with the RAGs. P-values were computed using the empirical statistical testing described in supplementary methods).</p
Gene regulatory networks activated in RA.
<p>A) Target enrichment scores representing the significances of overlaps between the targets of each TF and the RAGs belonging to the network modules. B–D) Gene regulatory networks describing the TF-target relationships for the three processes: T-cell activation including RUNX1 and FOXP3 (B), Matrix remodeling including AP-1 (JUN and FOS) and NFKB1 (C), and Cell proliferation and survival including NFAT5, E2F3, and TP53 (D).</p
Known molecular target candidates for diagnosis and therapy of RA.
<p>Known molecular target candidates for diagnosis and therapy of RA.</p
- …