1,616 research outputs found

    Accurate estimation of homologue-specific DNA concentration-ratios in cancer samples allows long-range haplotyping

    Get PDF
    Interpretation of allelic copy measurements at polymorphic markers in cancer samples presents distinctive challenges and opportunities. Due to frequent gross chromosomal alterations occurring in cancer (aneuploidy), many genomic regions are present at homologous-allele imbalance. Within such regions, the unequal contribution of alleles at heterozygous markers allows for direct phasing of the haplotype derived from each individual parent. In addition, genome-wide estimates of homologue specific copy- ratios (HSCRs) are important for interpretation of the cancer genome in terms of fixed integral copy-numbers. We describe HAPSEG, a probabilistic method to interpret bi- allelic marker data in cancer samples. HAPSEG operates by partitioning the genome into segments of distinct copy number and modeling the four distinct genotypes in each segment. We describe general methods for fitting these models to data which are suit- able for both SNP microarrays and massively parallel sequencing data. In addition, we demonstrate a specially tailored error-model for interpretation of systematic variations arising in microarray platforms. The ability to directly determine haplotypes from cancer samples represents an opportunity to expand reference panels of phased chromosomes, which may have general interest in various population genetic applications. In addition, this property may be exploited to interrogate the relationship between germline risk and cancer phenotype with greater sensitivity than is possible using unphased genotype. Finally, we exploit the statistical dependency of phased genotypes to enable the fitting of more elaborate sample-level error-model parameters, allowing more accurate estimation of HSCRs in cancer samples

    Impact of the spotted microarray preprocessing method on fold-change compression and variance stability

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The standard approach for preprocessing spotted microarray data is to subtract the local background intensity from the spot foreground intensity, to perform a log2 transformation and to normalize the data with a global median or a lowess normalization. Although well motivated, standard approaches for background correction and for transformation have been widely criticized because they produce high variance at low intensities. Whereas various alternatives to the standard background correction methods and to log2 transformation were proposed, impacts of both successive preprocessing steps were not compared in an objective way.</p> <p>Results</p> <p>In this study, we assessed the impact of eight preprocessing methods combining four background correction methods and two transformations (the log2 and the glog), by using data from the MAQC study. The current results indicate that most preprocessing methods produce fold-change compression at low intensities. Fold-change compression was minimized using the Standard and the Edwards background correction methods coupled with a log2 transformation. The drawback of both methods is a high variance at low intensities which consequently produced poor estimations of the p-values. On the other hand, effective stabilization of the variance as well as better estimations of the p-values were observed after the glog transformation.</p> <p>Conclusion</p> <p>As both fold-change magnitudes and p-values are important in the context of microarray class comparison studies, we therefore recommend to combine the Edwards correction with a hybrid transformation method that uses the log2 transformation to estimate fold-change magnitudes and the glog transformation to estimate p-values.</p

    Empirical Bayes models for multiple probe type microarrays at the probe level

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>When analyzing microarray data a primary objective is often to find differentially expressed genes. With empirical Bayes and penalized t-tests the sample variances are adjusted towards a global estimate, producing more stable results compared to ordinary t-tests. However, for Affymetrix type data a clear dependency between variability and intensity-level generally exists, even for logged intensities, most clearly for data at the probe level but also for probe-set summarizes such as the MAS5 expression index. As a consequence, adjustment towards a global estimate results in an intensity-level dependent false positive rate.</p> <p>Results</p> <p>We propose two new methods for finding differentially expressed genes, Probe level Locally moderated Weighted median-t (PLW) and Locally Moderated Weighted-t (LMW). Both methods use an empirical Bayes model taking the dependency between variability and intensity-level into account. A global covariance matrix is also used allowing for differing variances between arrays as well as array-to-array correlations. PLW is specially designed for Affymetrix type arrays (or other multiple-probe arrays). Instead of making inference on probe-set summaries, comparisons are made separately for each perfect-match probe and are then summarized into one score for the probe-set.</p> <p>Conclusion</p> <p>The proposed methods are compared to 14 existing methods using five spike-in data sets. For RMA and GCRMA processed data, PLW has the most accurate ranking of regulated genes in four out of the five data sets, and LMW consistently performs better than all examined moderated t-tests when used on RMA, GCRMA, and MAS5 expression indexes.</p

    Microarray background correction: maximum likelihood estimation for the normal–exponential convolution

    Get PDF
    Background correction is an important preprocessing step for microarray data that attempts to adjust the data for the ambient intensity surrounding each feature. The “normexp” method models the observed pixel intensities as the sum of 2 random variables, one normally distributed and the other exponentially distributed, representing background noise and signal, respectively. Using a saddle-point approximation, Ritchie and others (2007) found normexp to be the best background correction method for 2-color microarray data. This article develops the normexp method further by improving the estimation of the parameters. A complete mathematical development is given of the normexp model and the associated saddle-point approximation. Some subtle numerical programming issues are solved which caused the original normexp method to fail occasionally when applied to unusual data sets. A practical and reliable algorithm is developed for exact maximum likelihood estimation (MLE) using high-quality optimization software and using the saddle-point estimates as starting values. “MLE” is shown to outperform heuristic estimators proposed by other authors, both in terms of estimation accuracy and in terms of performance on real data. The saddle-point approximation is an adequate replacement in most practical situations. The performance of normexp for assessing differential expression is improved by adding a small offset to the corrected intensities

    Modeling gene expression measurement error: a quasi-likelihood approach

    Get PDF
    BACKGROUND: Using suitable error models for gene expression measurements is essential in the statistical analysis of microarray data. However, the true probabilistic model underlying gene expression intensity readings is generally not known. Instead, in currently used approaches some simple parametric model is assumed (usually a transformed normal distribution) or the empirical distribution is estimated. However, both these strategies may not be optimal for gene expression data, as the non-parametric approach ignores known structural information whereas the fully parametric models run the risk of misspecification. A further related problem is the choice of a suitable scale for the model (e.g. observed vs. log-scale). RESULTS: Here a simple semi-parametric model for gene expression measurement error is presented. In this approach inference is based an approximate likelihood function (the extended quasi-likelihood). Only partial knowledge about the unknown true distribution is required to construct this function. In case of gene expression this information is available in the form of the postulated (e.g. quadratic) variance structure of the data. As the quasi-likelihood behaves (almost) like a proper likelihood, it allows for the estimation of calibration and variance parameters, and it is also straightforward to obtain corresponding approximate confidence intervals. Unlike most other frameworks, it also allows analysis on any preferred scale, i.e. both on the original linear scale as well as on a transformed scale. It can also be employed in regression approaches to model systematic (e.g. array or dye) effects. CONCLUSIONS: The quasi-likelihood framework provides a simple and versatile approach to analyze gene expression data that does not make any strong distributional assumptions about the underlying error model. For several simulated as well as real data sets it provides a better fit to the data than competing models. In an example it also improved the power of tests to identify differential expression

    Development of statistical methods for the analysis of single-cell RNA-seq data

    Get PDF
    Single-cell RNA-sequencing profiles the transcriptome of cells from diverse populations. A popular intermediate data format is a large count matrix of genes x cells. This type of data brings several analytical challenges. Here, I present three projects that I worked on during my PhD that address particular aspects of working with such datasets: - The large number of cells in the count matrix is a challenge for fitting gamma-Poisson generalized linear models with existing tools. I developed a new R package called glmGamPoi to address this gap. I optimized the overdispersion estimation procedure to be quick and robust for datasets with many cells and small counts. I compared the performance against two popular tools (edgeR and DESeq2) and find that my inference is 6x to 13x faster and achieves a higher likelihood for a majority of the genes in four single-cell datasets. - The variance of single-cell RNA-seq counts depends on their mean but many existing statistical tools have optimal performance when the variance is uniform. Accordingly, variance-stabilizing transformations are applied to unlock the large number of methods with such an requirement. I compared four approaches to variance-stabilize the data based on the delta method, model residuals, inferred latent expression state or count factor analysis. I describe the theoretical strength and weaknesses, and compare their empirical performance in a benchmark on simulated and real single-cell data. I find that none of the mathematically more sophisticated transformations consistently outperform the simple log(y/s+1) transformation. - Multi-condition single-cell data offers the opportunity to find differentially expressed genes for individual cell subpopulations. However, the prevalent approach to analyze such data is to start by dividing the cells into discrete populations and then test for differential expression within each group. The results are interpretable but may miss interesting cases by (1) choosing the cluster size too small and lacking power to detect effects or (2) choosing the cluster size too large and obscuring interesting effects apparent on a smaller scale. I developed a new statistical framework for the analysis of multi-condition single-cell data that avoids the premature discretization. The approach performs regression on the latent subspaces occupied by the cells in each condition. The method is implemented as an R package called lemur

    A statistical framework for the analysis of microarray probe-level data

    Full text link
    In microarray technology, a number of critical steps are required to convert the raw measurements into the data relied upon by biologists and clinicians. These data manipulations, referred to as preprocessing, influence the quality of the ultimate measurements and studies that rely upon them. Standard operating procedure for microarray researchers is to use preprocessed data as the starting point for the statistical analyses that produce reported results. This has prevented many researchers from carefully considering their choice of preprocessing methodology. Furthermore, the fact that the preprocessing step affects the stochastic properties of the final statistical summaries is often ignored. In this paper we propose a statistical framework that permits the integration of preprocessing into the standard statistical analysis flow of microarray data. This general framework is relevant in many microarray platforms and motivates targeted analysis methods for specific applications. We demonstrate its usefulness by applying the idea in three different applications of the technology.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS116 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore