7,109 research outputs found
Linear model for fast background subtraction in oligonucleotide microarrays
One important preprocessing step in the analysis of microarray data is
background subtraction. In high-density oligonucleotide arrays this is
recognized as a crucial step for the global performance of the data analysis
from raw intensities to expression values.
We propose here an algorithm for background estimation based on a model in
which the cost function is quadratic in a set of fitting parameters such that
minimization can be performed through linear algebra. The model incorporates
two effects: 1) Correlated intensities between neighboring features in the chip
and 2) sequence-dependent affinities for non-specific hybridization fitted by
an extended nearest-neighbor model.
The algorithm has been tested on 360 GeneChips from publicly available data
of recent expression experiments. The algorithm is fast and accurate. Strong
correlations between the fitted values for different experiments as well as
between the free-energy parameters and their counterparts in aqueous solution
indicate that the model captures a significant part of the underlying physical
chemistry.Comment: 21 pages, 5 figure
Specific and non specific hybridization of oligonucleotide probes on microarrays
Gene expression analysis by means of microarrays is based on the sequence
specific binding of mRNA to DNA oligonucleotide probes and its measurement
using fluorescent labels. The binding of RNA fragments involving other
sequences than the intended target is problematic because it adds a "chemical
background" to the signal, which is not related to the expression degree of the
target gene. The paper presents a molecular signature of specific and non
specific hybridization with potential consequences for gene expression
analysis. We analyzed the signal intensities of perfect match (PM) and mismatch
(MM) probes of GeneChip microarrays to specify the effect of specific and non
specific hybridization. We found that these events give rise to different
relations between the PM and MM intensities as function of the middle base of
the PMs, namely a triplet- (C>G=T>A>0) and a duplet-like (C=T>0>G=A) pattern of
the PM-MM log-intensity difference upon binding of specific and non specific
RNA fragments, respectively. The systematic behaviour of the intensity
difference can be rationalized on the level of base pairings of DNA/RNA
oligonucleotide duplexes in the middle of the probe sequence. Non-specific
binding is characterized by the reversal of the central Watson Crick (WC)
pairing for each PM/MM probe pair, whereas specific binding refers to the
combination of a WC and a self complementary (SC) pairing in PM and MM probes,
respectively. The intensity of complementary MM introduces a systematic source
of variation which decreases the precision of expression measures based on the
MM intensities
Unsupervised Classification for Tiling Arrays: ChIP-chip and Transcriptome
Tiling arrays make possible a large scale exploration of the genome thanks to
probes which cover the whole genome with very high density until 2 000 000
probes. Biological questions usually addressed are either the expression
difference between two conditions or the detection of transcribed regions. In
this work we propose to consider simultaneously both questions as an
unsupervised classification problem by modeling the joint distribution of the
two conditions. In contrast to previous methods, we account for all available
information on the probes as well as biological knowledge like annotation and
spatial dependence between probes. Since probes are not biologically relevant
units we propose a classification rule for non-connected regions covered by
several probes. Applications to transcriptomic and ChIP-chip data of
Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the
importance of a precise modeling and the region classification
Probing Hybridization parameters from microarray experiments: nearest neighbor model and beyond
In this article it is shown how optimized and dedicated microarray
experiments can be used to study the thermodynamics of DNA hybridization for a
large number of different conformations in a highly parallel fashion. In
particular, free energy penalties for mismatches are obtained in two
independent ways and are shown to be correlated with values from melting
experiments in solution reported in the literature. The additivity principle,
which is at the basis of the nearest-neighbor model, and according to which the
penalty for two isolated mismatches is equal to the sum of the independent
penalties, is thoroughly tested. Additivity is shown to break down for a
mismatch distance below 5 nt. The behavior of mismatches in the vicinity of the
helix edges, and the behavior of tandem mismatches are also investigated.
Finally, some thermodynamic outlying sequences are observed and highlighted.
These sequences contain combinations of GA mismatches. The analysis of the
microarray data reported in this article provides new insights on the DNA
hybridization parameters and can help to increase the accuracy of
hybridization-based technologies.Comment: 13 pages, 11 figures, 1 table, Supplementary Data available in
Appendi
Modeling and Estimation for Real-Time Microarrays
Microarrays are used for collecting information about a large number of different genomic particles simultaneously. Conventional fluorescent-based microarrays acquire data after the hybridization phase. During this phase, the target analytes (e.g., DNA fragments) bind to the capturing probes on the array and, by the end of it, supposedly reach a steady state. Therefore, conventional microarrays attempt to detect and quantify the targets with a single data point taken in the steady state. On the other hand, a novel technique, the so-called real-time microarray, capable of recording the kinetics of hybridization in fluorescent-based microarrays has recently been proposed. The richness of the information obtained therein promises higher signal-to-noise ratio, smaller estimation error, and broader assay detection dynamic range compared to conventional microarrays. In this paper, we study the signal processing aspects of the real-time microarray system design. In particular, we develop a probabilistic model for real-time microarrays and describe a procedure for the estimation of target amounts therein. Moreover, leveraging on system identification ideas, we propose a novel technique for the elimination of cross hybridization. These are important steps toward developing optimal detection algorithms for real-time microarrays, and to understanding their fundamental limitations
ChIP-on-chip significance analysis reveals large-scale binding and regulation by human transcription factor oncogenes
ChIP-on-chip has emerged as a powerful tool to dissect the complex network of regulatory interactions between transcription factors and their targets. However, most ChIP-on-chip analysis methods use conservative approaches aimed to minimize false-positive transcription factor targets. We present a model with improved sensitivity in detecting binding events from ChIP-on-chip data. Biochemically validated analysis in human T-cells reveals that three transcription factor oncogenes, NOTCH1, MYC, and HES1, bind one order of magnitude more promoters than previously thought. Gene expression profiling upon NOTCH1 inhibition shows broad-scale functional regulation across the entire range of predicted target genes, establishing a closer link between occupancy and regulation. Finally, the resolution of a more complete map of transcriptional targets reveals that MYC binds nearly all promoters bound by NOTCH1. Overall, these results suggest an unappreciated complexity of transcriptional regulatory networks and highlight the fundamental importance of genome-scale analysis to represent transcriptional programs
Normalized Affymetrix expression data are biased by G-quadruplex formation
Probes with runs of four or more guanines (G-stacks) in their sequences can exhibit a level of hybridization that is unrelated to the expression levels of the mRNA that they are intended to measure. This is most likely caused by the formation of G-quadruplexes, where inter-probe guanines form Hoogsteen hydrogen bonds, which probes with G-stacks are capable of forming. We demonstrate that for a specific microarray data set using the Human HG-U133A Affymetrix GeneChip and RMA normalization there is significant bias in the expression levels, the fold change and the correlations between expression levels. These effects grow more pronounced as the number of G-stack probes in a probe set increases. Approximately 14 of the probe sets are directly affected. The analysis was repeated for a number of other normalization pipelines and two, FARMS and PLIER, minimized the bias to some extent. We estimate that âŒ15 of the data sets deposited in the GEO database are susceptible to the effect. The inclusion of G-stack probes in the affected data sets can bias key parameters used in the selection and clustering of genes. The elimination of these probes from any analysis in such affected data sets outweighs the increase of noise in the signal. © 2011 The Author(s)
A multi-view approach to cDNA micro-array analysis
The official published version can be obtained from the link below.Microarray has emerged as a powerful technology that enables biologists to study thousands of genes simultaneously, therefore, to obtain a better understanding of the gene interaction and regulation mechanisms. This paper is concerned with improving the processes involved in the analysis of microarray image data. The main focus is to clarify an image's feature space in an unsupervised manner. In this paper, the Image Transformation Engine (ITE), combined with different filters, is investigated. The proposed methods are applied to a set of real-world cDNA images. The MatCNN toolbox is used during the segmentation process. Quantitative comparisons between different filters are carried out. It is shown that the CLD filter is the best one to be applied with the ITE.This work was supported in part by the Engineering and Physical Sciences Research
Council (EPSRC) of the UK under Grant GR/S27658/01, the National Science Foundation of China under Innovative Grant 70621001, Chinese Academy of Sciences
under Innovative Group Overseas Partnership Grant, the BHP Billiton Cooperation of Australia Grant, the International Science and Technology Cooperation Project of China
under Grant 2009DFA32050 and the Alexander von Humboldt Foundation of Germany
Rank-statistics based enrichment-site prediction algorithm developed for chromatin immunoprecipitation on chip experiments
Background: High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. Tiling arrays are increasingly used in chromatin immunoprecipitation (IP) experiments (ChIP on chip). ChIP on chip facilitates the generation of genome-wide maps of in-vivo interactions between DNA-associated proteins including transcription factors and DNA. Analysis of the hybridization of an immunoprecipitated sample to a tiling array facilitates the identification of ChIP-enriched segments of the genome. These enriched segments are putative targets of antibody assayable regulatory elements. The enrichment response is not ubiquitous across the genome. Typically 5 to 10% of tiled probes manifest some significant enrichment. Depending upon the factor being studied, this response can drop to less than 1%. The detection and assessment of significance for interactions that emanate from non-canonical and/or un-annotated regions of the genome is especially challenging. This is the motivation behind the proposed algorithm. Results: We have proposed a novel rank and replicate statistics-based methodology for identifying and ascribing statistical confidence to regions of ChIP-enrichment. The algorithm is optimized for identification of sites that manifest low levels of enrichment but are true positives, as validated by alternative biochemical experiments. Although the method is described here in the context of ChIP on chip experiments, it can be generalized to any treatment-control experimental design. The results of the algorithm show a high degree of concordance with independent biochemical validation methods. The sensitivity and specificity of the algorithm have been characterized via quantitative PCR and independent computational approaches. Conclusion: The algorithm ranks all enrichment sites based on their intra-replicate ranks and inter-replicate rank consistency. Following the ranking, the method allows segmentation of sites based on a meta p-value, a composite array signal enrichment criterion, or a composite of these two measures. The sensitivities obtained subsequent to the segmentation of data using a meta p-value of 10(-5), an array signal enrichment of 0.2 and a composite of these two values are 88%, 87% and 95%, respectively
- âŠ