16,612 research outputs found
Measuring reproducibility of high-throughput experiments
Reproducibility is essential to reliable scientific discovery in
high-throughput experiments. In this work we propose a unified approach to
measure the reproducibility of findings identified from replicate experiments
and identify putative discoveries using reproducibility. Unlike the usual
scalar measures of reproducibility, our approach creates a curve, which
quantitatively assesses when the findings are no longer consistent across
replicates. Our curve is fitted by a copula mixture model, from which we derive
a quantitative reproducibility score, which we call the "irreproducible
discovery rate" (IDR) analogous to the FDR. This score can be computed at each
set of paired replicate ranks and permits the principled setting of thresholds
both for assessing reproducibility and combining replicates. Since our approach
permits an arbitrary scale for each replicate, it provides useful descriptive
measures in a wide variety of situations to be explored. We study the
performance of the algorithm using simulations and give a heuristic analysis of
its theoretical properties. We demonstrate the effectiveness of our method in a
ChIP-seq experiment.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS466 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Unifying review of linear gaussian models
Factor analysis, principal component analysis, mixtures of gaussian clusters, vector quantization, Kalman filter models, and hidden Markov models can all be unified as variations of unsupervised learning under a single basic generative model. This is achieved by collecting together disparate observations and derivations made by many previous authors and introducing a new way of linking discrete and continuous state models using a simple nonlinearity. Through the use of other nonlinearities, we show how independent component analysis is also a variation of the same basic generative model.We show that factor analysis and mixtures of gaussians can be implemented in autoencoder neural networks and learned using squared error plus the same regularization term. We introduce a new model for static data, known as sensible principal component analysis, as well as a novel concept of spatially adaptive observation noise. We also review some of the literature involving global and local mixtures of the basic models and provide pseudocode for inference and learning for all the basic models
Noisy independent component analysis of auto-correlated components
We present a new method for the separation of superimposed, independent,
auto-correlated components from noisy multi-channel measurement. The presented
method simultaneously reconstructs and separates the components, taking all
channels into account and thereby increases the effective signal-to-noise ratio
considerably, allowing separations even in the high noise regime.
Characteristics of the measurement instruments can be included, allowing for
application in complex measurement situations. Independent posterior samples
can be provided, permitting error estimates on all desired quantities. Using
the concept of information field theory, the algorithm is not restricted to any
dimensionality of the underlying space or discretization scheme thereof
Maximum-Likelihood Comparisons of Tully-Fisher and Redshift Data: Constraints on Omega and Biasing
We compare Tully-Fisher (TF) data for 838 galaxies within cz=3000 km/sec from
the Mark III catalog to the peculiar velocity and density fields predicted from
the 1.2 Jy IRAS redshift survey. Our goal is to test the relation between the
galaxy density and velocity fields predicted by gravitational instability
theory and linear biasing, and thereby to estimate where is the linear bias parameter for IRAS galaxies.
Adopting the IRAS velocity and density fields as a prior model, we maximize the
likelihood of the raw TF observables, taking into account the full range of
selection effects and properly treating triple-valued zones in the
redshift-distance relation. Extensive tests with realistic simulated galaxy
catalogs demonstrate that the method produces unbiased estimates of
and its error. When we apply the method to the real data, we model the presence
of a small but significant velocity quadrupole residual (~3.3% of Hubble flow),
which we argue is due to density fluctuations incompletely sampled by IRAS. The
method then yields a maximum likelihood estimate
(1-sigma error). We discuss the constraints on and biasing that follow
if we assume a COBE-normalized CDM power spectrum. Our model also yields the
1-D noise noise in the velocity field, including IRAS prediction errors, which
we find to be be 125 +/- 20 km/sec.Comment: 53 pages, 20 encapsulated figures, two tables. Submitted to the
Astrophysical Journal. Also available at http://astro.stanford.edu/jeff
- …