21 research outputs found
Diffusion models for probabilistic programming
We propose Diffusion Model Variational Inference (DMVI), a novel method for
automated approximate inference in probabilistic programming languages (PPLs).
DMVI utilizes diffusion models as variational approximations to the true
posterior distribution by deriving a novel bound to the marginal likelihood
objective used in Bayesian modelling. DMVI is easy to implement, allows
hassle-free inference in PPLs without the drawbacks of, e.g., variational
inference using normalizing flows, and does not make any constraints on the
underlying neural network model. We evaluate DMVI on a set of common Bayesian
models and show that its posterior inferences are in general more accurate than
those of contemporary methods used in PPLs while having a similar computational
cost and requiring less manual tuning.Comment: * Fix mathematical typos * Add conference inf
Simulation-based inference using surjective sequential neural likelihood estimation
We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel
method for simulation-based inference in models where the evaluation of the
likelihood function is not tractable and only a simulator that can generate
synthetic data is available. SSNL fits a dimensionality-reducing surjective
normalizing flow model and uses it as a surrogate likelihood function which
allows for conventional Bayesian inference using either Markov chain Monte
Carlo methods or variational inference. By embedding the data in a
low-dimensional space, SSNL solves several issues previous likelihood-based
methods had when applied to high-dimensional data sets that, for instance,
contain non-informative data dimensions or lie along a lower-dimensional
manifold. We evaluate SSNL on a wide variety of experiments and show that it
generally outperforms contemporary methods used in simulation-based inference,
for instance, on a challenging real-world example from astrophysics which
models the magnetic field strength of the sun using a solar dynamo model
PyBDA: a command line tool for automated analysis of big biological data sets
Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io
Structured hierarchical models for probabilistic inference from perturbation screening data
Genetic perturbation screening is an experimental method in biology to study cause and effect relationships between different biological entities. However, knocking out or knocking down genes is a highly error-prone process that complicates estimation of the effect sizes of the interventions. Here, we introduce a family of generative models, called the structured hierarchical model (SHM), for probabilistic inference of causal effects from perturbation screens. SHMs utilize classical hierarchical models to represent heterogeneous data and combine them with categorical Markov random fields to encode biological prior information over functionally related biological entities. The random field induces a clustering of functionally related genes which informs inference of parameters in the hierarchical model. The SHM is designed for extremely noisy data sets for which the true data generating process is difficult to model due to lack of domain knowledge or high stochasticity of the interventions. We apply the SHM to a pan-cancer genetic perturbation screen in order to identify genes that restrict the growth of an entire group of cancer cell lines and show that incorporating prior knowledge in the form of a graph improves inference of parameters
PyBDA: a command line tool for automated analysis of big biological data sets
Background
Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points.
Results
We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells.
Conclusion
PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.ISSN:1471-210
NetReg: Network-regularized linear models for biological association studies
Summary: Modelling biological associations or dependencies using linear regression is often complicated when the analyzed data-sets are high-dimensional and less observations than variables
are available (n p). For genomic data-sets penalized regression methods have been applied settling
this issue. Recently proposed regression models utilize prior knowledge on dependencies,
e.g. in the form of graphs, arguing that this information will lead to more reliable estimates for regression coefficients. However, none of the proposed models for multivariate genomic response
variables have been implemented as a computationally efficient, freely available library. In this
paper we propose netReg, a package for graph-penalized regression models that use large networks
and thousands of variables. netReg incorporates a priori generated biological graph information
into linear models yielding sparse or smooth solutions for regression coefficients.
Availability and implementation: netReg is implemented as both R-package and Cþþ commandline
tool. The main computations are done in Cþþ, where we use Armadillo for fast matrix calculations
and Dlib for optimization. The R package is freely available on Bioconductor https://bioconductor.org/
packages/netReg. The command line tool can be installed using the conda channel Bioconda.
Installation details, issue reports, development versions, documentation and tutorials for the R and
Cþþ versions and the R package vignette can be found on GitHub https://dirmeier.github.io/netReg/.
The GitHub page also contains code for benchmarking and example datasets used in this paper.ISSN:1367-4803ISSN:1460-205
netReg: network-regularized linear models for biological association studies
Dirmeier S, Fuchs C, Mueller NS, Theis FJ. netReg: network-regularized linear models for biological association studies. Bioinformatics. 2017;34(5):896-898.Modelling biological associations or dependencies using linear regression is often complicated when the analyzed data-sets are high-dimensional and less observations than variables are available (n ≪ p). For genomic data-sets penalized regression methods have been applied settling this issue. Recently proposed regression models utilize prior knowledge on dependencies, e.g. in the form of graphs, arguing that this information will lead to more reliable estimates for regression coefficients. However, none of the proposed models for multivariate genomic response variables have been implemented as a computationally efficient, freely available library. In this paper we propose netReg, a package for graph-penalized regression models that use large networks and thousands of variables. netReg incorporates a priori generated biological graph information into linear models yielding sparse or smooth solutions for regression coefficients
Evaluating the Robustness of Deep Learning Models for Mobility Prediction Through Causal Interventions
Changes in the characteristics of mobility data can significantly influence the predictive performance of deep learning models. However, there is still a lack of understanding of the degree of their impacts and the robustness of deep learning models against the variability of these characteristics. This hinders the development of benchmark datasets for evaluating different mobility prediction models. In this study, we use a causal intervention approach to evaluate the robustness of deep learning models towards different interventions of mobility data characteristics, using both traffic forecast and individual next-location prediction as case studies
netReg: network-regularized linear models for biological association studies
Abstract Summary Modelling biological associations or dependencies using linear regression is often complicated when the analyzed data-sets are high-dimensional and less observations than variables are available (n ≪ p). For genomic data-sets penalized regression methods have been applied settling this issue. Recently proposed regression models utilize prior knowledge on dependencies, e.g. in the form of graphs, arguing that this information will lead to more reliable estimates for regression coefficients. However, none of the proposed models for multivariate genomic response variables have been implemented as a computationally efficient, freely available library. In this paper we propose netReg, a package for graph-penalized regression models that use large networks and thousands of variables. netReg incorporates a priori generated biological graph information into linear models yielding sparse or smooth solutions for regression coefficients. Availability and implementation netReg is implemented as both R-package and C ++ commandline tool. The main computations are done in C ++, where we use Armadillo for fast matrix calculations and Dlib for optimization. The R package is freely available on Bioconductorhttps://bioconductor.org/packages/netReg. The command line tool can be installed using the conda channel Bioconda. Installation details, issue reports, development versions, documentation and tutorials for the R and C ++ versions and the R package vignette can be found on GitHub https://dirmeier.github.io/netReg/. The GitHub page also contains code for benchmarking and example datasets used in this paper