2,580 research outputs found

    Accurate estimation of homologue-specific DNA concentration-ratios in cancer samples allows long-range haplotyping

    Get PDF
    Interpretation of allelic copy measurements at polymorphic markers in cancer samples presents distinctive challenges and opportunities. Due to frequent gross chromosomal alterations occurring in cancer (aneuploidy), many genomic regions are present at homologous-allele imbalance. Within such regions, the unequal contribution of alleles at heterozygous markers allows for direct phasing of the haplotype derived from each individual parent. In addition, genome-wide estimates of homologue specific copy- ratios (HSCRs) are important for interpretation of the cancer genome in terms of fixed integral copy-numbers. We describe HAPSEG, a probabilistic method to interpret bi- allelic marker data in cancer samples. HAPSEG operates by partitioning the genome into segments of distinct copy number and modeling the four distinct genotypes in each segment. We describe general methods for fitting these models to data which are suit- able for both SNP microarrays and massively parallel sequencing data. In addition, we demonstrate a specially tailored error-model for interpretation of systematic variations arising in microarray platforms. The ability to directly determine haplotypes from cancer samples represents an opportunity to expand reference panels of phased chromosomes, which may have general interest in various population genetic applications. In addition, this property may be exploited to interrogate the relationship between germline risk and cancer phenotype with greater sensitivity than is possible using unphased genotype. Finally, we exploit the statistical dependency of phased genotypes to enable the fitting of more elaborate sample-level error-model parameters, allowing more accurate estimation of HSCRs in cancer samples

    Physico-chemical foundations underpinning microarray and next-generation sequencing experiments

    Get PDF
    Hybridization of nucleic acids on solid surfaces is a key process involved in high-throughput technologies such as microarrays and, in some cases, next-generation sequencing (NGS). A physical understanding of the hybridization process helps to determine the accuracy of these technologies. The goal of a widespread research program is to develop reliable transformations between the raw signals reported by the technologies and individual molecular concentrations from an ensemble of nucleic acids. This research has inputs from many areas, from bioinformatics and biostatistics, to theoretical and experimental biochemistry and biophysics, to computer simulations. A group of leading researchers met in Ploen Germany in 2011 to discuss present knowledge and limitations of our physico-chemical understanding of high-throughput nucleic acid technologies. This meeting inspired us to write this summary, which provides an overview of the state-of-the-art approaches based on physico-chemical foundation to modeling of the nucleic acids hybridization process on solid surfaces. In addition, practical application of current knowledge is emphasized

    Methodological study of affine transformations of gene expression data with proposed robust non-parametric multi-dimensional normalization method

    Get PDF
    BACKGROUND: Low-level processing and normalization of microarray data are most important steps in microarray analysis, which have profound impact on downstream analysis. Multiple methods have been suggested to date, but it is not clear which is the best. It is therefore important to further study the different normalization methods in detail and the nature of microarray data in general. RESULTS: A methodological study of affine models for gene expression data is carried out. Focus is on two-channel comparative studies, but the findings generalize also to single- and multi-channel data. The discussion applies to spotted as well as in-situ synthesized microarray data. Existing normalization methods such as curve-fit ("lowess") normalization, parallel and perpendicular translation normalization, and quantile normalization, but also dye-swap normalization are revisited in the light of the affine model and their strengths and weaknesses are investigated in this context. As a direct result from this study, we propose a robust non-parametric multi-dimensional affine normalization method, which can be applied to any number of microarrays with any number of channels either individually or all at once. A high-quality cDNA microarray data set with spike-in controls is used to demonstrate the power of the affine model and the proposed normalization method. CONCLUSION: We find that an affine model can explain non-linear intensity-dependent systematic effects in observed log-ratios. Affine normalization removes such artifacts for non-differentially expressed genes and assures that symmetry between negative and positive log-ratios is obtained, which is fundamental when identifying differentially expressed genes. In addition, affine normalization makes the empirical distributions in different channels more equal, which is the purpose of quantile normalization, and may also explain why dye-swap normalization works or fails. All methods are made available in the aroma package, which is a platform-independent package for R

    Spectral analysis of gene expression profiles using gene networks

    Full text link
    Microarrays have become extremely useful for analysing genetic phenomena, but establishing a relation between microarray analysis results (typically a list of genes) and their biological significance is often difficult. Currently, the standard approach is to map a posteriori the results onto gene networks to elucidate the functions perturbed at the level of pathways. However, integrating a priori knowledge of the gene networks could help in the statistical analysis of gene expression data and in their biological interpretation. Here we propose a method to integrate a priori the knowledge of a gene network in the analysis of gene expression data. The approach is based on the spectral decomposition of gene expression profiles with respect to the eigenfunctions of the graph, resulting in an attenuation of the high-frequency components of the expression profiles with respect to the topology of the graph. We show how to derive unsupervised and supervised classification algorithms of expression profiles, resulting in classifiers with biological relevance. We applied the method to the analysis of a set of expression profiles from irradiated and non-irradiated yeast strains. It performed at least as well as the usual classification but provides much more biologically relevant results and allows a direct biological interpretation

    Microarray Data Preprocessing: From Experimental Design to Differential Analysis

    Get PDF
    DNA microarray data preprocessing is of utmost importance in the analytical path starting from the experimental design and leading to a reliable biological interpretation. In fact, when all relevant aspects regarding the experimental plan have been considered, the following steps from data quality check to differential analysis will lead to robust, trustworthy results. In this chapter, all the relevant aspects and considerations about microarray preprocessing will be discussed. Preprocessing steps are organized in an orderly manner, from experimental design to quality check and batch effect removal, including the most common visualization methods. Furthermore, we will discuss data representation and differential testing methods with a focus on the most common microarray technologies, such as gene expression and DNA methylation.Peer reviewe

    GaGa: A parsimonious and flexible model for differential expression analysis

    Full text link
    Hierarchical models are a powerful tool for high-throughput data with a small to moderate number of replicates, as they allow sharing information across units of information, for example, genes. We propose two such models and show its increased sensitivity in microarray differential expression applications. We build on the gamma--gamma hierarchical model introduced by Kendziorski et al. [Statist. Med. 22 (2003) 3899--3914] and Newton et al. [Biostatistics 5 (2004) 155--176], by addressing important limitations that may have hampered its performance and its more widespread use. The models parsimoniously describe the expression of thousands of genes with a small number of hyper-parameters. This makes them easy to interpret and analytically tractable. The first model is a simple extension that improves the fit substantially with almost no increase in complexity. We propose a second extension that uses a mixture of gamma distributions to further improve the fit, at the expense of increased computational burden. We derive several approximations that significantly reduce the computational cost. We find that our models outperform the original formulation of the model, as well as some other popular methods for differential expression analysis. The improved performance is specially noticeable for the small sample sizes commonly encountered in high-throughput experiments. Our methods are implemented in the freely available Bioconductor gaga package.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS244 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore