68 research outputs found

    Algorithm-driven Artifacts in median polish summarization of Microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput measurement of transcript intensities using Affymetrix type oligonucleotide microarrays has produced a massive quantity of data during the last decade. Different preprocessing techniques exist to convert the raw signal intensities measured by these chips into gene expression estimates. Although these techniques have been widely benchmarked in the context of differential gene expression analysis, there are only few examples where their performance has been assessed in respect to coexpression-based studies such as sample classification.</p> <p>Results</p> <p>In the present paper we benchmark the three most used normalization procedures (MAS5, RMA and GCRMA) in the context of inter-array correlation analysis, confirming and extending the finding that RMA and GCRMA consistently overestimate sample similarity upon normalization. We determine that median polish summarization is responsible for generating a large proportion of these over-similarity artifacts. Furthermore, we show that most affected probesets show also internal signal disagreement, and tend to be composed by individual probes hitting different gene transcripts. We finally provide a correction to the RMA/GCRMA summarization procedure that massively reduces inter-array correlation artifacts, without affecting the detection of differentially expressed genes.</p> <p>Conclusions</p> <p>We propose tRMA as a modification of RMA to normalize microarray experiments for correlation-based analysis.</p

    Normalized Affymetrix expression data are biased by G-quadruplex formation

    Get PDF
    Probes with runs of four or more guanines (G-stacks) in their sequences can exhibit a level of hybridization that is unrelated to the expression levels of the mRNA that they are intended to measure. This is most likely caused by the formation of G-quadruplexes, where inter-probe guanines form Hoogsteen hydrogen bonds, which probes with G-stacks are capable of forming. We demonstrate that for a specific microarray data set using the Human HG-U133A Affymetrix GeneChip and RMA normalization there is significant bias in the expression levels, the fold change and the correlations between expression levels. These effects grow more pronounced as the number of G-stack probes in a probe set increases. Approximately 14 of the probe sets are directly affected. The analysis was repeated for a number of other normalization pipelines and two, FARMS and PLIER, minimized the bias to some extent. We estimate that ∼15 of the data sets deposited in the GEO database are susceptible to the effect. The inclusion of G-stack probes in the affected data sets can bias key parameters used in the selection and clustering of genes. The elimination of these probes from any analysis in such affected data sets outweighs the increase of noise in the signal. © 2011 The Author(s)

    A single-sample microarray normalization method to facilitate personalized-medicine workflows

    Get PDF
    AbstractGene-expression microarrays allow researchers to characterize biological phenomena in a high-throughput fashion but are subject to technological biases and inevitable variabilities that arise during sample collection and processing. Normalization techniques aim to correct such biases. Most existing methods require multiple samples to be processed in aggregate; consequently, each sample's output is influenced by other samples processed jointly. However, in personalized-medicine workflows, samples may arrive serially, so renormalizing all samples upon each new arrival would be impractical. We have developed Single Channel Array Normalization (SCAN), a single-sample technique that models the effects of probe-nucleotide composition on fluorescence intensity and corrects for such effects, dramatically increasing the signal-to-noise ratio within individual samples while decreasing variation across samples. In various benchmark comparisons, we show that SCAN performs as well as or better than competing methods yet has no dependence on external reference samples and can be applied to any single-channel microarray platform

    Normalization of oligonucleotide arrays based on the least-variant set of genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is well known that the normalization step of microarray data makes a difference in the downstream analysis. All normalization methods rely on certain assumptions, so differences in results can be traced to different sensitivities to violation of the assumptions. Illustrating the lack of robustness, in a striking spike-in experiment all existing normalization methods fail because of an imbalance between up- and down-regulated genes. This means it is still important to develop a normalization method that is robust against violation of the standard assumptions</p> <p>Results</p> <p>We develop a new algorithm based on identification of the least-variant set (LVS) of genes across the arrays. The array-to-array variation is evaluated in the robust linear model fit of pre-normalized probe-level data. The genes are then used as a reference set for a non-linear normalization. The method is applicable to any existing expression summaries, such as MAS5 or RMA.</p> <p>Conclusion</p> <p>We show that LVS normalization outperforms other normalization methods when the standard assumptions are not satisfied. In the complex spike-in study, LVS performs similarly to the ideal (in practice unknown) housekeeping-gene normalization. An R package called lvs is available in <url>http://www.meb.ki.se/~yudpaw</url>.</p

    Biasogram: visualization of confounding technical bias in gene expression data.

    Get PDF
    Gene expression profiles of clinical cohorts can be used to identify genes that are correlated with a clinical variable of interest such as patient outcome or response to a particular drug. However, expression measurements are susceptible to technical bias caused by variation in extraneous factors such as RNA quality and array hybridization conditions. If such technical bias is correlated with the clinical variable of interest, the likelihood of identifying false positive genes is increased. Here we describe a method to visualize an expression matrix as a projection of all genes onto a plane defined by a clinical variable and a technical nuisance variable. The resulting plot indicates the extent to which each gene is correlated with the clinical variable or the technical variable. We demonstrate this method by applying it to three clinical trial microarray data sets, one of which identified genes that may have been driven by a confounding technical variable. This approach can be used as a quality control step to identify data sets that are likely to yield false positive results

    Genomic approaches to unveil the physiological pathways activated in Arabidopsis treated with plant-derived raw extracts

    Get PDF
    DNA microarrays can be used to obtain a fingerprint of the transcriptional status of the plant or cell under a given condition and may be useful for characterising which genes respond, either by induction or repression, to novel stimuli or specific treatments. An in-depth bioinformatical analysis of all the data produced by microarrays can further highlight the metabolic or functional pathways most affected by the treatment. This approach has been used to investigate the effects induced by the treatment of different plant-derived raw materials, provided by Valagro SpA, on Arabidopsis seedlings. A clear example is represented by treatment with a raw plant-derived protein extract (VAL-P01). In this case the treatment induced genes related to ABA and osmotic stress treatment. We therefore demonstrated that VAL-P01 was able to mimic in planta the same pattern of responses linked to ABA treatment or osmotic stress, making the plant stronger against possible further stresses. Another plant extract, VAL-P02, was shown to be significantly altering the transcription of senescence genes, making it an ideal candidate adjuvant for the prolonged shelf-life of vegetal products

    A flexible and versatile framework for statistical design and analysis of quantitative mass spectrometry-based proteomic experiments

    Get PDF
    Quantitative mass spectrometry (MS)-based proteomics is an indispensable technology for biological and clinical research. As the proteomics field grows, MS-based proteomic workflows are becoming more complex and diverse. The accuracy and the throughput of the MS measurements and of the signal processing tools dramatically increased. However, many existing statistical tools and workflows have not followed the technological development. Therefore, there is a need for flexible statistical tools, which reflect diverse and complex workflows, are computationally efficient for large datasets, and maximize the reproducibility of the results. We propose a family of linear mixed effects models, and a split-plot view of the experimental design, that represent measurements from quantitative mass spectrometry-based proteomics. The whole plot part of the design reflects the structure of the biological variation of the experiment, such as case-control design, paired design, or time-course design. The subplot part of the design reflects the structure of the technological variation, such as fragmentation patterns, labeling strategy, and presence of multiple peptides per protein. We propose an estimation procedure that separately estimates the parameters of the subplot and the whole plot parts of the design, to maximize the flexibility of the model, increase the speed of the analysis, and facilitate the interpretation. The proposed modeling framework was validated using 9 controlled mixtures and 10 experimental datasets from targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS), where signals were extracted with multiple signal processing tools. We implemented the proposed method in the software package MSstats, which checks the correctness of the user input, recognizes arbitrary complex experimental design, visualizes the data and performs statistical modeling and inference. It is interoperable with other existing computational tools such as Skyline

    Statistical methods for differential proteomics at peptide and protein level

    Get PDF

    Growth gone awry: exploring the role of embryonic liver development genes in HCV induced cirrhosis and hepatocellular carcinoma

    Get PDF
    Introduction and methods: Hepatocellular carcinoma (HCC) remains a difficult disease to study even after a decade of genomic analysis. Metabolic and cell-cycle perturbations are known, large changes in tumors that add little to our understanding of the development of tumors, but generate “noise” that obscures potentially important smaller scale expression changes in “driver genes”. Recently, some researchers have suggested that HCC shares pathways involving the master regulators of embryonic development. Here, we investigated the involvement and specificity of developmental genes in HCV-cirrhosis and HCV-HCC. We obtained microarray studies from 30 patients with HCV-cirrhosis and 49 patients with HCV-HCC and compared to 12 normal livers. Differential gene expression is specific to liver development genes: 86 of 202 (43%) genes specific to liver development had differential expression between normal and cirrhotic or HCC samples. Of 60 genes with paralogous function, which are specific to development of other organs and have known associations with other cancer types, none were expressed in either adult normal liver or tumor tissue. Developmental genes are widely differentially expressed in both cirrhosis and early HCC, but not late HCC: 69 liver development genes were differentially expressed in cirrhosis, and 58 of these (84%) were also dysregulated in early HCC. 19/58 (33%) had larger-magnitude changes in cirrhosis and 5 (9%) had larger-magnitude changes in early HCC. 16 (9%) genes were uniquely altered in early tumors, while only 2 genes were uniquely changed in late-stage (T3 and T4) HCC. Together, these results suggest that the involvement of the master regulators of liver development are active in the pre-cancerous cirrhotic liver and in cirrhotic livers with emerging tumors but play a limited role in the transition from early to late stage HCC. Common patterns of coordinated developmental gene expression include: (1) Dysregulation of BMP2 signaling in cirrhosis followed by overexpression of BMP inhibitors in HCC. BMP inhibitor GPC3 was overexpressed in nearly all tumors, while GREM1 was associated specifically with recurrence-free survival after ablation and transplant. (2) Cirrhosis tissues acquire a progenitor-like signature including high expression of Vimentin, EPCAM, and KRT19, and these markers remain over-expressed to a lesser extent in HCC. (3) Hepatocyte proliferation inhibitors (HPI) E-cadherin (CDH1), BMP2, and MST1 were highly expressed in cirrhosis and remained over-expressed in 16 HCC patients who were transplanted with excellent recurrence-free survival (94% survival after 2 years; mean recurrence-free survival = 5.6 yrs), while loss in early HCC was associated with early recurrence and (2 year). Loss of HPI overexpression was also correlated with overexpression of c-MET and loss of STAT3, LAMA2, FGFR2, CITED2, KIT, SMAD7, GATA6, ERBB2, and NOTCH2
    corecore