79 research outputs found

    Improving Accuracy in Cell-Perturbation Experiments by Leveraging Auxiliary Information

    Full text link
    Modern cell-perturbation experiments expose cells to panels of hundreds of stimuli, such as cytokines or CRISPR guides that perform gene knockouts. These experiments are designed to investigate whether a particular gene is upregulated or downregulated by exposure to each treatment. However, due to high levels of experimental noise, typical estimators of whether a gene is up- or down-regulated make many errors. In this paper, we make two contributions. Our first contribution is a new estimator of regulatory effect that makes use of Gaussian processes and factor analysis to leverage auxiliary information about similarities among treatments, such as the chemical similarity among the drugs used to perturb cells. The new estimator typically has lower variance than unregularized estimators, which do not use auxiliary information, but higher bias. To assess whether this new estimator improves accuracy (i.e., achieves a favorable trade-off between bias and variance), we cannot simply compute its error on heldout data as ``ground truth'' about the effects of treatments is unavailable. Our second contribution is a novel data-splitting method to evaluate error rates. This data-splitting method produces valid error bounds using ``sign-valid'' estimators, which by definition have the correct sign more often than not. Using this data-splitting method, through a series of case studies we find that our new estimator, which leverages auxiliary information, can yield a three-fold reduction in type S error rate

    Normalizing Flows for Knockoff-free Controlled Feature Selection

    Full text link
    Controlled feature selection aims to discover the features a response depends on while limiting the false discovery rate (FDR) to a predefined level. Recently, multiple deep-learning-based methods have been proposed to perform controlled feature selection through the Model-X knockoff framework. We demonstrate, however, that these methods often fail to control the FDR for two reasons. First, these methods often learn inaccurate models of features. Second, the "swap" property, which is required for knockoffs to be valid, is often not well enforced. We propose a new procedure called FlowSelect that remedies both of these problems. To more accurately model the features, FlowSelect uses normalizing flows, the state-of-the-art method for density estimation. To circumvent the need to enforce the swap property, FlowSelect uses a novel MCMC-based procedure to calculate p-values for each feature directly. Asymptotically, FlowSelect computes valid p-values. Empirically, FlowSelect consistently controls the FDR on both synthetic and semi-synthetic benchmarks, whereas competing knockoff-based approaches do not. FlowSelect also demonstrates greater power on these benchmarks. Additionally, FlowSelect correctly infers the genetic variants associated with specific soybean traits from GWAS data.Comment: 20 pages, 9 figures, 3 table

    Approximate Inference for Constructing Astronomical Catalogs from Images

    Full text link
    We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Carlo (MCMC) while the other is based on variational inference (VI). The MCMC procedure excels at quantifying uncertainty, while the VI procedure is 1000 times faster. On a supercomputer, the VI procedure efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50 terabytes of images in 14.6 minutes, demonstrating the scaling characteristics necessary to construct catalogs for upcoming astronomical surveys.Comment: accepted to the Annals of Applied Statistic

    Diffusion Models for Probabilistic Deconvolution of Galaxy Images

    Full text link
    Telescopes capture images with a particular point spread function (PSF). Inferring what an image would have looked like with a much sharper PSF, a problem known as PSF deconvolution, is ill-posed because PSF convolution is not an invertible transformation. Deep generative models are appealing for PSF deconvolution because they can infer a posterior distribution over candidate images that, if convolved with the PSF, could have generated the observation. However, classical deep generative models such as VAEs and GANs often provide inadequate sample diversity. As an alternative, we propose a classifier-free conditional diffusion model for PSF deconvolution of galaxy images. We demonstrate that this diffusion model captures a greater diversity of possible deconvolutions compared to a conditional VAE.Comment: Accepted to the ICML 2023 Workshop on Machine Learning for Astrophysic
    corecore