79 research outputs found
Improving Accuracy in Cell-Perturbation Experiments by Leveraging Auxiliary Information
Modern cell-perturbation experiments expose cells to panels of hundreds of
stimuli, such as cytokines or CRISPR guides that perform gene knockouts. These
experiments are designed to investigate whether a particular gene is
upregulated or downregulated by exposure to each treatment. However, due to
high levels of experimental noise, typical estimators of whether a gene is up-
or down-regulated make many errors. In this paper, we make two contributions.
Our first contribution is a new estimator of regulatory effect that makes use
of Gaussian processes and factor analysis to leverage auxiliary information
about similarities among treatments, such as the chemical similarity among the
drugs used to perturb cells. The new estimator typically has lower variance
than unregularized estimators, which do not use auxiliary information, but
higher bias. To assess whether this new estimator improves accuracy (i.e.,
achieves a favorable trade-off between bias and variance), we cannot simply
compute its error on heldout data as ``ground truth'' about the effects of
treatments is unavailable. Our second contribution is a novel data-splitting
method to evaluate error rates. This data-splitting method produces valid error
bounds using ``sign-valid'' estimators, which by definition have the correct
sign more often than not. Using this data-splitting method, through a series of
case studies we find that our new estimator, which leverages auxiliary
information, can yield a three-fold reduction in type S error rate
Normalizing Flows for Knockoff-free Controlled Feature Selection
Controlled feature selection aims to discover the features a response depends
on while limiting the false discovery rate (FDR) to a predefined level.
Recently, multiple deep-learning-based methods have been proposed to perform
controlled feature selection through the Model-X knockoff framework. We
demonstrate, however, that these methods often fail to control the FDR for two
reasons. First, these methods often learn inaccurate models of features.
Second, the "swap" property, which is required for knockoffs to be valid, is
often not well enforced. We propose a new procedure called FlowSelect that
remedies both of these problems. To more accurately model the features,
FlowSelect uses normalizing flows, the state-of-the-art method for density
estimation. To circumvent the need to enforce the swap property, FlowSelect
uses a novel MCMC-based procedure to calculate p-values for each feature
directly. Asymptotically, FlowSelect computes valid p-values. Empirically,
FlowSelect consistently controls the FDR on both synthetic and semi-synthetic
benchmarks, whereas competing knockoff-based approaches do not. FlowSelect also
demonstrates greater power on these benchmarks. Additionally, FlowSelect
correctly infers the genetic variants associated with specific soybean traits
from GWAS data.Comment: 20 pages, 9 figures, 3 table
Approximate Inference for Constructing Astronomical Catalogs from Images
We present a new, fully generative model for constructing astronomical
catalogs from optical telescope image sets. Each pixel intensity is treated as
a random variable with parameters that depend on the latent properties of stars
and galaxies. These latent properties are themselves modeled as random. We
compare two procedures for posterior inference. One procedure is based on
Markov chain Monte Carlo (MCMC) while the other is based on variational
inference (VI). The MCMC procedure excels at quantifying uncertainty, while the
VI procedure is 1000 times faster. On a supercomputer, the VI procedure
efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50
terabytes of images in 14.6 minutes, demonstrating the scaling characteristics
necessary to construct catalogs for upcoming astronomical surveys.Comment: accepted to the Annals of Applied Statistic
Diffusion Models for Probabilistic Deconvolution of Galaxy Images
Telescopes capture images with a particular point spread function (PSF).
Inferring what an image would have looked like with a much sharper PSF, a
problem known as PSF deconvolution, is ill-posed because PSF convolution is not
an invertible transformation. Deep generative models are appealing for PSF
deconvolution because they can infer a posterior distribution over candidate
images that, if convolved with the PSF, could have generated the observation.
However, classical deep generative models such as VAEs and GANs often provide
inadequate sample diversity. As an alternative, we propose a classifier-free
conditional diffusion model for PSF deconvolution of galaxy images. We
demonstrate that this diffusion model captures a greater diversity of possible
deconvolutions compared to a conditional VAE.Comment: Accepted to the ICML 2023 Workshop on Machine Learning for
Astrophysic
- …