Search CORE

University of Essex Research Repository

VBN

Normalized Affymetrix expression data are biased by G-quadruplex formation

Author: Altman
Andrew P. Harrison
Barrett
Bolstad
Burge
Cambon
Do
Dudoit
Eisen
Farhat N. Memon
Geller
Gellert
Giorgi
Graham J. G. Upton
Hammond
Harris
Hochreiter
Hubbell
Hugh P. Shanahan
Irizarry
Irizarry
Iwamoto
Kittleson
Langdon
Li
Memon
Memon
Naef
Patterson
Ringnér
Ryan
Sen
Stalteri
Upton
Upton
Walton
Wu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2011
Field of study

Probes with runs of four or more guanines (G-stacks) in their sequences can exhibit a level of hybridization that is unrelated to the expression levels of the mRNA that they are intended to measure. This is most likely caused by the formation of G-quadruplexes, where inter-probe guanines form Hoogsteen hydrogen bonds, which probes with G-stacks are capable of forming. We demonstrate that for a specific microarray data set using the Human HG-U133A Affymetrix GeneChip and RMA normalization there is significant bias in the expression levels, the fold change and the correlations between expression levels. These effects grow more pronounced as the number of G-stack probes in a probe set increases. Approximately 14 of the probe sets are directly affected. The analysis was repeated for a number of other normalization pipelines and two, FARMS and PLIER, minimized the bias to some extent. We estimate that ∼15 of the data sets deposited in the GEO database are susceptible to the effect. The inclusion of G-stack probes in the affected data sets can bias key parameters used in the selection and clustering of genes. The elimination of these probes from any analysis in such affected data sets outweighs the increase of noise in the signal. © 2011 The Author(s)

CiteSeerX

Royal Holloway Research Online

Royal Holloway - Pure

Knowledge-based gene expression classification via matrix factorization

Author: A. M. Tomé
Affymetrix
Allison
Baldi
Barnhill
Bolstad
Breiman
Cardoso
Cardoso
Chen
D. Lutter
Diaz-Uriarte
Diaz-Uriarte
Dougherty
Dougherty
Dudoit
E. W. Lang
F. J. Theis
G. Schmitz
Galton
Galton
Golub
Guyon
Hochreiter
Irrizarry
Lee
Li
Liebermeister
Liu
Lutter
M. Stetter
Mangasarian
P. Gómez Vilda
P. Knollmüller
Pearson
Quackenbush
R. Schachtner
Saidi
Schachtner
Schachtner
Schölkopf
Simon
Spang
Talloen
Troyanskaya
Tusher
Wu
Wu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2008
Field of study

Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.Siemens AG, MunichDFG (Graduate College 638)DAAD (PPP Luso - Alem˜a and PPP Hispano - Alemanas

University of Regensburg Publication Server

Repositório Institucional da Universidade de Aveiro

Clemson University: TigerPrints

PuSH

DEVELOPMENT OF MAP/REDUCE BASED MICROARRAY ANALYSIS TOOLS

Author: Yang Guangyu
Publication venue: Clemson University Libraries
Publication date: 01/08/2013
Field of study

High density oligonucleotide array (microarray) from the Affymetrix GeneChip¨ system has been widely used for the measurements of gene expressions. Currently, public data repositories, such as Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (NCBI), have accumulated very large amount of microarray data. For example, there are 84389 human and 9654 Arabidopsis microarray experiments in GEO database. Efficiently integrative analysis large amount of microarray data will provide more knowledge about the biological systems. Traditional microarray analysis tools all implemented sequential algorithms and can only be run on single processor. They are not able to handle very large microarray data sets with thousands of experiments. It is necessary to develop new microarray analysis tools using parallel framework. In this thesis, I implemented microarray quality assessment, background correction, normalization and summarization algorithms using the Map/Reduce framework. The Map/Reduce framework, first introduced by Google in 2004, offers a promising paradigm to develop scalable parallel applications for large-scale data. Evaluation of our new implementation on large microarray data of rice and Arabidopsis showed that they have good speedups. For example, running rice microarray data using our implementations of MAS5.0 algorithms on 20 computer nodes totally 320 processors has a 28 times speedup over using previous C++ implementation on single processor. Our new microarray tools will make it possible to utilize the valuable experiments in the public repositories

μ-CS: An extension of the TM4 platform to manage Affymetrix binary data

Author: AI Saeed
G Alonso
J Quackenbush
K Owzar
L Corradi
M Barnes
MA Hibbs
Mario Cannataro
P Li
Pietro H Guzzi
RA Irizarry
S Durinck
T Park
TY Chang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background A main goal in understanding cell mechanisms is to explain the relationship among genes and related molecular processes through the combined use of technological platforms and bioinformatics analysis. High throughput platforms, such as microarrays, enable the investigation of the whole genome in a single experiment. There exist different kind of microarray platforms, that produce different types of binary data (images and raw data). Moreover, also considering a single vendor, different chips are available. The analysis of microarray data requires an initial preprocessing phase (i.e. normalization and summarization) of raw data that makes them suitable for use on existing platforms, such as the TIGR M4 Suite. Nevertheless, the annotations of data with additional information such as gene function, is needed to perform more powerful analysis. Raw data preprocessing and annotation is often performed in a manual and error prone way. Moreover, many available preprocessing tools do not support annotation. Thus novel, platform independent, and possibly open source tools enabling the semi-automatic preprocessing and annotation of microarray data are needed. Results The paper presents <it>μ</it>-CS (Microarray Cel file Summarizer), a cross-platform tool for the automatic normalization, summarization and annotation of Affymetrix binary data. <it>μ</it>-CS is based on a client-server architecture. The <it>μ</it>-CS client is provided both as a plug-in of the TIGR M4 platform and as a Java standalone tool and enables users to read, preprocess and analyse binary microarray data, avoiding the manual invocation of external tools (e.g. the Affymetrix Power Tools), the manual loading of preprocessing libraries, and the management of intermediate files. The <it>μ</it>-CS server automatically updates the references to the summarization and annotation libraries that are provided to the <it>μ</it>-CS client before the preprocessing. The <it>μ</it>-CS server is based on the web services technology and can be easily extended to support more microarray vendors (e.g. Illumina). Conclusions Thus <it>μ</it>-CS users can directly manage binary data without worrying about locating and invoking the proper preprocessing tools and chip-specific libraries. Moreover, users of the <it>μ</it>-CS plugin for TM4 can manage Affymetrix binary files without using external tools, such as APT (Affymetrix Power Tools) and related libraries. Consequently, <it>μ</it>-CS offers four main advantages: (i) it avoids to waste time for searching the correct libraries, (ii) it reduces possible errors in the preprocessing and further analysis phases, e.g. due to the incorrect choice of parameters or the use of old libraries, (iii) it implements the annotation of preprocessed data, and finally, (iv) it may enhance the quality of further analysis since it provides the most updated annotation libraries. The <it>μ</it>-CS client is freely available as a plugin of the TM4 platform as well as a standalone application at the project web site (<url>http://bioingegneria.unicz.it/M-CS</url>).</p

Different effects of the probe summarization algorithms PLIER and RMA on high-level analysis of Affymetrix exon arrays

Author: Chen Yuchen
He Fei
Qu Yi
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Alternative splicing is an important mechanism that increases protein diversity and functionality in higher eukaryotes. Affymetrix exon arrays are a commercialized platform used to detect alternative splicing on a genome-wide scale. Two probe summarization algorithms, PLIER (Probe Logarithmic Intensity Error) and RMA (Robust Multichip Average), are commonly used to compute gene-level and exon-level expression values. However, a systematic comparison of these two algorithms on their effects on high-level analysis of the arrays has not yet been reported. Results In this study, we showed that PLIER summarization led to over-estimation of gene-level expression changes, relative to exon-level expression changes, in two-group comparisons. Consequently, it led to detection of substantially more skipped exons on up-regulated genes, as well as substantially more included (i.e., non-skipped) exons on down-regulated genes. In contrast, this bias was not observed for RMA-summarized data. By using a published human tissue dataset, we compared the tissue-specific expression and splicing detected by Affymetrix exon arrays with those detected based on expressed sequence databases. We found the tendency of PLIER was not supported by the expressed sequence data. Conclusion We showed that the tendency of PLIER in detection of alternative splicing is likely caused by a technical bias in the approach, rather than a biological bias. Moreover, we observed abnormal summarization results when using the PLIER algorithm, indicating that mathematical problems, such as numerical instability, may affect PLIER performance.</p

Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: High-resolution annotation for microarrays

Author: Cam Margaret C
Lee Joseph C
Lu Jun
Salit Marc L
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Extracting biological information from high-density Affymetrix arrays is a multi-step process that begins with the accurate annotation of microarray probes. Shortfalls in the original Affymetrix probe annotation have been described; however, few studies have provided rigorous solutions for routine data analysis. RESULTS: Using AceView, a comprehensive human transcript database, we have reannotated the probes by matching them to RNA transcripts instead of genes. Based on this transcript-level annotation, a new probe set definition was created in which every probe in a probe set maps to a common set of AceView gene transcripts. In addition, using artificial data sets we identified that a minimal probe set size of 4 is necessary for reliable statistical summarization. We further demonstrate that applying the new probe set definition can detect specific transcript variants contributing to differential expression and it also improves cross-platform concordance. CONCLUSION: We conclude that our transcript-level reannotation and redefinition of probe sets complement the original Affymetrix design. Redefinitions introduce probe sets whose sizes may not support reliable statistical summarization; therefore, we advocate using our transcript-level mapping redefinition in a secondary analysis step rather than as a replacement. Knowing which specific transcripts are differentially expressed is important to properly design probe/primer pairs for validation purposes. For convenience, we have created custom chip-description-files (CDFs) and annotation files for our new probe set definitions that are compatible with Bioconductor, Affymetrix Expression Console or third party software

A probe-treatment-reference (PTR) model for the analysis of oligonucleotide expression microarrays

Author: Cheng Chao
Ge Huanying
Li Lei M
Publication venue: BioMed Central
Publication date: 01/04/2008
Field of study

Abstract Background Microarray pre-processing usually consists of normalization and summarization. Normalization aims to remove non-biological variations across different arrays. The normalization algorithms generally require the specification of reference and target arrays. The issue of reference selection has not been fully addressed. Summarization aims to estimate the transcript abundance from normalized intensities. In this paper, we consider normalization and summarization jointly by a new strategy of reference selection. Results We propose a Probe-Treatment-Reference (PTR) model to streamline normalization and summarization by allowing multiple references. We estimate parameters in the model by the Least Absolute Deviations (LAD) approach and implement the computation by median polishing. We show that the LAD estimator is robust in the sense that it has bounded influence in the three-factor PTR model. This model fitting, implicitly, defines an "optimal reference" for each probe-set. We evaluate the effectiveness of the PTR method by two Affymetrix spike-in data sets. Our method reduces the variations of non-differentially expressed genes and thereby increases the detection power of differentially expressed genes. Conclusion Our results indicate that the reference effect is important and should be considered in microarray pre-processing. The proposed PTR method is a general framework to deal with the issue of reference selection and can readily be applied to existing normalization algorithms such as the invariant-set, sub-array and quantile method.</p

Preferred analysis methods for Affymetrix GeneChips. II. An expanded, balanced, wholly-defined spike-in dataset

Author: Halfon Marc S
Miecznikowski Jeffrey C
Zhu Qianqian
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study