23,861 research outputs found
Normalized Affymetrix expression data are biased by G-quadruplex formation
Probes with runs of four or more guanines (G-stacks) in their sequences can exhibit a level of hybridization that is unrelated to the expression levels of the mRNA that they are intended to measure. This is most likely caused by the formation of G-quadruplexes, where inter-probe guanines form Hoogsteen hydrogen bonds, which probes with G-stacks are capable of forming. We demonstrate that for a specific microarray data set using the Human HG-U133A Affymetrix GeneChip and RMA normalization there is significant bias in the expression levels, the fold change and the correlations between expression levels. These effects grow more pronounced as the number of G-stack probes in a probe set increases. Approximately 14 of the probe sets are directly affected. The analysis was repeated for a number of other normalization pipelines and two, FARMS and PLIER, minimized the bias to some extent. We estimate that ∼15 of the data sets deposited in the GEO database are susceptible to the effect. The inclusion of G-stack probes in the affected data sets can bias key parameters used in the selection and clustering of genes. The elimination of these probes from any analysis in such affected data sets outweighs the increase of noise in the signal. © 2011 The Author(s)
Listen to genes : dealing with microarray data in the frequency domain
Background: We present a novel and systematic approach to analyze temporal microarray data. The approach includes
normalization, clustering and network analysis of genes.
Methodology: Genes are normalized using an error model based uniform normalization method aimed at identifying and
estimating the sources of variations. The model minimizes the correlation among error terms across replicates. The
normalized gene expressions are then clustered in terms of their power spectrum density. The method of complex Granger
causality is introduced to reveal interactions between sets of genes. Complex Granger causality along with partial Granger
causality is applied in both time and frequency domains to selected as well as all the genes to reveal the interesting
networks of interactions. The approach is successfully applied to Arabidopsis leaf microarray data generated from 31,000
genes observed over 22 time points over 22 days. Three circuits: a circadian gene circuit, an ethylene circuit and a new
global circuit showing a hierarchical structure to determine the initiators of leaf senescence are analyzed in detail.
Conclusions: We use a totally data-driven approach to form biological hypothesis. Clustering using the power-spectrum
analysis helps us identify genes of potential interest. Their dynamics can be captured accurately in the time and frequency
domain using the methods of complex and partial Granger causality. With the rise in availability of temporal microarray
data, such methods can be useful tools in uncovering the hidden biological interactions. We show our method in a step by
step manner with help of toy models as well as a real biological dataset. We also analyse three distinct gene circuits of
potential interest to Arabidopsis researchers
Discussion of: Treelets--An adaptive multi-scale basis for sparse unordered data
This is a discussion of paper "Treelets--An adaptive multi-scale basis for
sparse unordered data" [arXiv:0707.0481] by Ann B. Lee, Boaz Nadler and Larry
Wasserman. In this paper the authors defined a new type of dimension reduction
algorithm, namely, the treelet algorithm. The treelet method has the merit of
being completely data driven, and its decomposition is easier to interpret as
compared to PCR. It is suitable in some certain situations, but it also has its
own limitations. I will discuss both the strength and the weakness of this
method when applied to microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS137E the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
maigesPack: A Computational Environment for Microarray Data Analysis
Microarray technology is still an important way to assess gene expression in
molecular biology, mainly because it measures expression profiles for thousands
of genes simultaneously, what makes this technology a good option for some
studies focused on systems biology. One of its main problem is complexity of
experimental procedure, presenting several sources of variability, hindering
statistical modeling. So far, there is no standard protocol for generation and
evaluation of microarray data. To mitigate the analysis process this paper
presents an R package, named maigesPack, that helps with data organization.
Besides that, it makes data analysis process more robust, reliable and
reproducible. Also, maigesPack aggregates several data analysis procedures
reported in literature, for instance: cluster analysis, differential
expression, supervised classifiers, relevance networks and functional
classification of gene groups or gene networks
Study of meta-analysis strategies for network inference using information-theoretic approaches
© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Reverse engineering of gene regulatory networks (GRNs) from gene expression data is a classical challenge in systems biology. Thanks to high-throughput technologies, a massive amount of gene-expression data has been accumulated in the public repositories. Modelling GRNs from multiple experiments (also called integrative analysis) has; therefore, naturally become a standard procedure in modern computational biology. Indeed, such analysis is usually more robust than the traditional approaches focused on individual datasets, which typically suffer from some experimental bias and a small number of samples.
To date, there are mainly two strategies for the problem of interest: the first one (”data merging”) merges all datasets together and then infers a GRN whereas the other (”networks ensemble”) infers GRNs from every dataset separately and then aggregates them using some ensemble rules (such as ranksum or weightsum). Unfortunately, a thorough comparison of these two approaches is lacking.
In this paper, we evaluate the performances of various metaanalysis approaches mentioned above with a systematic set of experiments based on in silico benchmarks. Furthermore, we present a new meta-analysis approach for inferring GRNs from multiple studies. Our proposed approach, adapted to methods based on pairwise measures such as correlation or mutual information, consists of two steps: aggregating matrices of the pairwise measures from every dataset followed by extracting the network from the meta-matrix.Peer ReviewedPostprint (author's final draft
Profound effect of profiling platform and normalization strategy on detection of differentially expressed microRNAs
Adequate normalization minimizes the effects of systematic technical variations and is a prerequisite for getting meaningful biological changes. However, there is inconsistency about miRNA normalization performances and recommendations. Thus, we investigated the impact of seven different normalization methods (reference gene index, global geometric mean, quantile, invariant selection, loess, loessM, and generalized procrustes analysis) on intra- and inter-platform performance of two distinct and commonly used miRNA profiling platforms. We included data from miRNA profiling analyses derived from a hybridization-based platform (Agilent Technologies) and an RT-qPCR platform (Applied Biosystems). Furthermore, we validated a subset of miRNAs by individual RT-qPCR assays. Our analyses incorporated data from the effect of differentiation and tumor necrosis factor alpha treatment on primary human skeletal muscle cells and a murine skeletal muscle cell line. Distinct normalization methods differed in their impact on (i) standard deviations, (ii) the area under the receiver operating characteristic (ROC) curve, (iii) the similarity of differential expression. Loess, loessM, and quantile analysis were most effective in minimizing standard deviations on the Agilent and TLDA platform. Moreover, loess, loessM, invariant selection and generalized procrustes analysis increased the area under the ROC curve, a measure for the statistical performance of a test. The Jaccard index revealed that inter-platform concordance of differential expression tended to be increased by loess, loessM, quantile, and GPA normalization of AGL and TLDA data as well as RGI normalization of TLDA data. We recommend the application of loess, or loessM, and GPA normalization for miRNA Agilent arrays and qPCR cards as these normalization approaches showed to (i) effectively reduce standard deviations, (ii) increase sensitivity and accuracy of differential miRNA expression detection as well as (iii) increase inter-platform concordance. Results showed the successful adoption of loessM and generalized procrustes analysis to one-color miRNA profiling experiments
Optimal classifier selection and negative bias in error rate estimation: An empirical study on high-dimensional prediction
In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.
In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. We then assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.
We conclude that the strategy to present only the optimal result is not acceptable, and suggest alternative approaches for properly reporting classification accuracy
Starr: Simple Tiling Array Analysis of Affymetrix ChIP-chip data
Chromatin immunoprecipitation combined with DNA microarrays (ChIP-chip) is an
assay for DNA-protein-binding or post-translational chromatin/histone
modifications. As with all high-throughput technologies, it requires a thorough
bioinformatic processing of the data for which there is no standard yet. The
primary goal is the reliable identification and localization of genomic regions
that bind a specific protein. The second step comprises comparison of binding
profiles of functionally related proteins, or of binding profiles of the same
protein in different genetic backgrounds or environmental conditions.
Ultimately, one would like to gain a mechanistic understanding of the effects
of DNA binding events on gene expression. We present a free, open-source R
package Starr that, in combination with the package Ringo, facilitates the
comparative analysis of ChIP-chip data across experiments and across different
microarray platforms. Core features are data import, quality assessment,
normalization and visualization of the data, and the detection of ChIP-enriched
genomic regions. The use of common Bioconductor classes ensures the
compatibility with other R packages. Most importantly, Starr provides methods
for integration of complementary genomics data, e.g., it enables systematic
investigation of the relation between gene expression and dna binding
- …