Search CORE

21 research outputs found

Diffusion models for probabilistic programming

Author: Dirmeier Simon
Perez-Cruz Fernando
Publication venue
Publication date: 21/11/2023
Field of study

We propose Diffusion Model Variational Inference (DMVI), a novel method for automated approximate inference in probabilistic programming languages (PPLs). DMVI utilizes diffusion models as variational approximations to the true posterior distribution by deriving a novel bound to the marginal likelihood objective used in Bayesian modelling. DMVI is easy to implement, allows hassle-free inference in PPLs without the drawbacks of, e.g., variational inference using normalizing flows, and does not make any constraints on the underlying neural network model. We evaluate DMVI on a set of common Bayesian models and show that its posterior inferences are in general more accurate than those of contemporary methods used in PPLs while having a similar computational cost and requiring less manual tuning.Comment: * Fix mathematical typos * Add conference inf

arXiv.org e-Print Archive

Simulation-based inference using surjective sequential neural likelihood estimation

Author: Albert Carlo
Dirmeier Simon
Perez-Cruz Fernando
Publication venue
Publication date: 02/08/2023
Field of study

We present Surjective Sequential Neural Likelihood (SSNL) estimation, a novel method for simulation-based inference in models where the evaluation of the likelihood function is not tractable and only a simulator that can generate synthetic data is available. SSNL fits a dimensionality-reducing surjective normalizing flow model and uses it as a surrogate likelihood function which allows for conventional Bayesian inference using either Markov chain Monte Carlo methods or variational inference. By embedding the data in a low-dimensional space, SSNL solves several issues previous likelihood-based methods had when applied to high-dimensional data sets that, for instance, contain non-informative data dimensions or lie along a lower-dimensional manifold. We evaluate SSNL on a wide variety of experiments and show that it generally outperforms contemporary methods used in simulation-based inference, for instance, on a challenging real-world example from astrophysics which models the magnetic field strength of the sun using a solar dynamo model

arXiv.org e-Print Archive

PyBDA: a command line tool for automated analysis of big biological data sets

Author: Beerenwinkel Niko
Dehio Christoph
Dirmeier Simon
Emmenlauer Mario
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io

edoc

Statistical modelling of genetic perturbations

Author: Dirmeier Simon
Publication venue: ETH Zurich
Publication date: 01/01/2020
Field of study

Repository for Publications and Research Data

Structured hierarchical models for probabilistic inference from perturbation screening data

Author: Beerenwinkel Niko
Dirmeier Simon
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 20/11/2019
Field of study

Genetic perturbation screening is an experimental method in biology to study cause and effect relationships between different biological entities. However, knocking out or knocking down genes is a highly error-prone process that complicates estimation of the effect sizes of the interventions. Here, we introduce a family of generative models, called the structured hierarchical model (SHM), for probabilistic inference of causal effects from perturbation screens. SHMs utilize classical hierarchical models to represent heterogeneous data and combine them with categorical Markov random fields to encode biological prior information over functionally related biological entities. The random field induces a clustering of functionally related genes which informs inference of parameters in the hierarchical model. The SHM is designed for extremely noisy data sets for which the true data generating process is difficult to model due to lack of domain knowledge or high stochasticity of the interventions. We apply the SHM to a pan-cancer genetic perturbation screen in order to identify genes that restrict the growth of an entire group of cancer cell lines and show that incorporating prior knowledge in the form of a graph improves inference of parameters

Repository for Publications and Research Data

PyBDA: a command line tool for automated analysis of big biological data sets

Author: Beerenwinkel Niko
Dehio Christoph
Dirmeier Simon
Emmenlauer Mario
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.ISSN:1471-210

Repository for Publications and Research Data

NetReg: Network-regularized linear models for biological association studies

Author: Dirmeier Simon
Fuchs Christiane
Müller Nikola S.
Theis Fabian J.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/03/2018
Field of study

Summary: Modelling biological associations or dependencies using linear regression is often complicated when the analyzed data-sets are high-dimensional and less observations than variables are available (n p). For genomic data-sets penalized regression methods have been applied settling this issue. Recently proposed regression models utilize prior knowledge on dependencies, e.g. in the form of graphs, arguing that this information will lead to more reliable estimates for regression coefficients. However, none of the proposed models for multivariate genomic response variables have been implemented as a computationally efficient, freely available library. In this paper we propose netReg, a package for graph-penalized regression models that use large networks and thousands of variables. netReg incorporates a priori generated biological graph information into linear models yielding sparse or smooth solutions for regression coefficients. Availability and implementation: netReg is implemented as both R-package and Cþþ commandline tool. The main computations are done in Cþþ, where we use Armadillo for fast matrix calculations and Dlib for optimization. The R package is freely available on Bioconductor https://bioconductor.org/ packages/netReg. The command line tool can be installed using the conda channel Bioconda. Installation details, issue reports, development versions, documentation and tutorials for the R and Cþþ versions and the R package vignette can be found on GitHub https://dirmeier.github.io/netReg/. The GitHub page also contains code for benchmarking and example datasets used in this paper.ISSN:1367-4803ISSN:1460-205

Repository for Publications and Research Data

netReg: network-regularized linear models for biological association studies

Author: Dirmeier Simon
Fuchs Christiane
Mueller Nikola S
Theis Fabian J
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2017
Field of study

Dirmeier S, Fuchs C, Mueller NS, Theis FJ. netReg: network-regularized linear models for biological association studies. Bioinformatics. 2017;34(5):896-898.Modelling biological associations or dependencies using linear regression is often complicated when the analyzed data-sets are high-dimensional and less observations than variables are available (n ≪ p). For genomic data-sets penalized regression methods have been applied settling this issue. Recently proposed regression models utilize prior knowledge on dependencies, e.g. in the form of graphs, arguing that this information will lead to more reliable estimates for regression coefficients. However, none of the proposed models for multivariate genomic response variables have been implemented as a computationally efficient, freely available library. In this paper we propose netReg, a package for graph-penalized regression models that use large networks and thousands of variables. netReg incorporates a priori generated biological graph information into linear models yielding sparse or smooth solutions for regression coefficients

Publications at Bielefeld University

Evaluating the Robustness of Deep Learning Models for Mobility Prediction Through Causal Interventions

Author: Dirmeier Simon
Hong Ye
Perez-Cruz Fernando
Raubal Martin
Xin Yanan
Publication venue: ETH Zurich
Publication date: 01/06/2023
Field of study

Changes in the characteristics of mobility data can significantly influence the predictive performance of deep learning models. However, there is still a lack of understanding of the degree of their impacts and the robustness of deep learning models against the variability of these characteristics. This hinders the development of benchmark datasets for evaluating different mobility prediction models. In this study, we use a causal intervention approach to evaluate the robustness of deep learning models towards different interventions of mobility data characteristics, using both traffic forecast and individual next-location prediction as case studies

Repository for Publications and Research Data

netReg: network-regularized linear models for biological association studies

Author: Dirmeier Simon
Fuchs Christiane
Mueller Nikola S.
Theis Fabian J.
Wren Jonathan
Wren Jonathan
Publication venue
Publication date: 05/10/2021
Field of study

Abstract Summary Modelling biological associations or dependencies using linear regression is often complicated when the analyzed data-sets are high-dimensional and less observations than variables are available (n ≪ p). For genomic data-sets penalized regression methods have been applied settling this issue. Recently proposed regression models utilize prior knowledge on dependencies, e.g. in the form of graphs, arguing that this information will lead to more reliable estimates for regression coefficients. However, none of the proposed models for multivariate genomic response variables have been implemented as a computationally efficient, freely available library. In this paper we propose netReg, a package for graph-penalized regression models that use large networks and thousands of variables. netReg incorporates a priori generated biological graph information into linear models yielding sparse or smooth solutions for regression coefficients. Availability and implementation netReg is implemented as both R-package and C ++ commandline tool. The main computations are done in C ++, where we use Armadillo for fast matrix calculations and Dlib for optimization. The R package is freely available on Bioconductorhttps://bioconductor.org/packages/netReg. The command line tool can be installed using the conda channel Bioconda. Installation details, issue reports, development versions, documentation and tutorials for the R and C ++ versions and the R package vignette can be found on GitHub https://dirmeier.github.io/netReg/. The GitHub page also contains code for benchmarking and example datasets used in this paper

RERO DOC Digital Library