Search CORE

372 research outputs found

GaGa: A parsimonious and flexible model for differential expression analysis

Author: Rossell David
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2009
Field of study

Hierarchical models are a powerful tool for high-throughput data with a small to moderate number of replicates, as they allow sharing information across units of information, for example, genes. We propose two such models and show its increased sensitivity in microarray differential expression applications. We build on the gamma--gamma hierarchical model introduced by Kendziorski et al. [Statist. Med. 22 (2003) 3899--3914] and Newton et al. [Biostatistics 5 (2004) 155--176], by addressing important limitations that may have hampered its performance and its more widespread use. The models parsimoniously describe the expression of thousands of genes with a small number of hyper-parameters. This makes them easy to interpret and analytically tractable. The first model is a simple extension that improves the fit substantially with almost no increase in complexity. We propose a second extension that uses a mixture of gamma distributions to further improve the fit, at the expense of increased computational burden. We derive several approximations that significantly reduce the computational cost. We find that our models outperform the original formulation of the model, as well as some other popular methods for differential expression analysis. The improved performance is specially noticeable for the small sample sizes commonly encountered in high-throughput experiments. Our methods are implemented in the freely available Bioconductor gaga package.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS244 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Warwick Research Archives Portal Repository

On choosing mixture components via non-local priors

Author: Fúquene Jairo
Rossell David
Steel Mark
Publication venue
Publication date: 11/06/2019
Field of study

Choosing the number of mixture components remains an elusive challenge. Model selection criteria can be either overly liberal or conservative and return poorly-separated components of limited practical use. We formalize non-local priors (NLPs) for mixtures and show how they lead to well-separated components with non-negligible weight, interpretable as distinct subpopulations. We also propose an estimator for posterior model probabilities under local and non-local priors, showing that Bayes factors are ratios of posterior to prior empty-cluster probabilities. The estimator is widely applicable and helps set thresholds to drop unoccupied components in overfitted mixtures. We suggest default prior parameters based on multi-modality for Normal/T mixtures and minimal informativeness for categorical outcomes. We characterise theoretically the NLP-induced sparsity, derive tractable expressions and algorithms. We fully develop Normal, Binomial and product Binomial mixtures but the theory, computation and principles hold more generally. We observed a serious lack of sensitivity of the Bayesian information criterion (BIC), insufficient parsimony of the AIC and a local prior, and a mixed behavior of the singular BIC. We also considered overfitted mixtures, their performance was competitive but depended on tuning parameters. Under our default prior elicitation NLPs offered a good compromise between sparsity and power to detect meaningfully-separated components

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository

Dades massives i estadística: La perspectiva d'un estadístic

Author: Rossell David
Publication venue
Publication date: 01/01/2014
Field of study

Les dades massives (big data) representen un recurs sense precedents per a afrontar reptes científics, econòmics i socials, però també incrementen la possibilitat de traure conclusions enganyoses. Per exemple, l’ús d’enfocaments basats exclusivament en dades i que es despreocupen de comprendre el fenomen en estudi, que s’orienten a un objectiu esmunyedís i canviant, que no tenen en compte problemes determinants en la recopilació de dades, que resumeixen o «cuinen» inadequadament les dades i que confonen el soroll amb el senyal. Repassarem alguns casos reeixits i il·lustrarem com poden ajudar els principis de l’estadística a obtenir una informació més fiable de les dades. També abordarem els reptes actuals que requereixen estudis metodològics dinàmics, com les estratègies d’eficiència computacional, la integració de dades heterogènies, estendre els fonaments teòrics a qu?estions cada vegada més complexes i, potser el més important, formar una nova generació de científics capaços de desenvolupar i implantar aquestes estratègies

Crossref

Repositori d'Objectes Digitals per a l'Ensenyament la Recerca i la Cultura

Quantifying alternative splicing from paired-end RNA-sequencing data

Author: Attolini Camille Stephan-Otto
Kroiss Manuel
Rossell David
Stöcker Almond
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/03/2014
Field of study

RNA-sequencing has revolutionized biomedical research and, in particular, our ability to study gene alternative splicing. The problem has important implications for human health, as alternative splicing may be involved in malfunctions at the cellular level and multiple diseases. However, the high-dimensional nature of the data and the existence of experimental biases pose serious data analysis challenges. We find that the standard data summaries used to study alternative splicing are severely limited, as they ignore a substantial amount of valuable information. Current data analysis methods are based on such summaries and are hence suboptimal. Further, they have limited flexibility in accounting for technical biases. We propose novel data summaries and a Bayesian modeling framework that overcome these limitations and determine biases in a nonparametric, highly flexible manner. These summaries adapt naturally to the rapid improvements in sequencing technology. We provide efficient point estimates and uncertainty assessments. The approach allows to study alternative splicing patterns for individual samples and can also be the basis for downstream analyses. We found a severalfold improvement in estimation mean square error compared popular approaches in simulations, and substantially higher consistency between replicates in experimental data. Our findings indicate the need for adjusting the routine summarization and analysis of alternative splicing RNA-seq studies. We provide a software implementation in the R package casper.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS687 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org). With correction

arXiv.org e-Print Archive

Crossref

PubMed Central

Warwick Research Archives Portal Repository

Rhapso : automatic stitching of mass segments from fourier transform ion cyclotron resonance mass spectra

Author: Barrow Mark P.
Gavard Remy
Guzman Alexander
Palacio Lozano Diana Catalina
Rossell David
Spencer Simon E. F.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2019
Field of study

Fourier transform ion cyclotron resonance mass spectrometry (FTICR MS) provides the resolution and mass accuracy needed to analyze complex mixtures such as crude oil. When mixtures contain many different components, a competitive effect within the ICR cell takes place that hampers the detection of a potentially large fraction of the components. Recently, a new data collection technique, which consists of acquiring several spectra of small mass ranges and assembling a complete spectrum afterward, enabled the observation of a record number of peaks with greater accuracy compared to broadband methods. There is a need for statistical methods to combine and preprocess segmented acquisition data. A particular challenge of quadrupole isolation is that near the window edges there is a drop in intensity, hampering the stitching of consecutive windows. We developed an algorithm called Rhapso to stitch peak lists corresponding to multiple different m/z regions from crude oil samples. Rhapso corrects potential edge effects to enable the use of smaller windows and reduce the required overlap between windows, corrects mass shifts between windows, and generates a single peak list for the full spectrum. Relative to a stitching performed manually, Rhapso increased the data processing speed and avoided potential human errors, simplifying the subsequent chemical analysis of the sample. Relative to a broadband spectrum, the stitched output showed an over 2-fold increase in assigned peaks and reduced mass error by a factor of 2. Rhapso is expected to enable routine use of this spectral stitching method for ultracomplex samples, giving a more detailed characterization of existing samples and enabling the characterization of samples that were previously too complex to analyze

Warwick Research Archives Portal Repository

UPF Digital Repository