52 research outputs found
A normalization technique for next generation sequencing experiments
Next generation sequencing (NGS) are these days one of the key technologies in biology. NGS' cost effectiveness and capability of finding the smallest variations in the genome makes them increasingly popular. For studies aiming at genome assembly, differences in read count statistics do not affect the outcome. However, these differences bias the outcome if the goal is to identify structural DNA characteristics like copy number variations (CNVs). Thus a normalization step must removed such random read count variations subsequently read counts from different experiments are comparable. Especially after normalization the commonly used assumption of Poisson read count distribution in windows on the chromosomes is more justified. Strong deviations of read counts from the estimated mean Poisson distribution indicate CNVs
Principled Weight Initialisation for Input-Convex Neural Networks
Input-Convex Neural Networks (ICNNs) are networks that guarantee convexity in
their input-output mapping. These networks have been successfully applied for
energy-based modelling, optimal transport problems and learning invariances.
The convexity of ICNNs is achieved by using non-decreasing convex activation
functions and non-negative weights. Because of these peculiarities, previous
initialisation strategies, which implicitly assume centred weights, are not
effective for ICNNs. By studying signal propagation through layers with
non-negative weights, we are able to derive a principled weight initialisation
for ICNNs. Concretely, we generalise signal propagation theory by removing the
assumption that weights are sampled from a centred distribution. In a set of
experiments, we demonstrate that our principled initialisation effectively
accelerates learning in ICNNs and leads to better generalisation. Moreover, we
find that, in contrast to common belief, ICNNs can be trained without
skip-connections when initialised correctly. Finally, we apply ICNNs to a
real-world drug discovery task and show that they allow for more effective
molecular latent space exploration.Comment: Presented at NeurIPS 202
Fr\'echet ChemNet Distance: A metric for generative models for molecules in drug discovery
The new wave of successful generative models in machine learning has
increased the interest in deep learning driven de novo drug design. However,
assessing the performance of such generative models is notoriously difficult.
Metrics that are typically used to assess the performance of such generative
models are the percentage of chemically valid molecules or the similarity to
real molecules in terms of particular descriptors, such as the partition
coefficient (logP) or druglikeness. However, method comparison is difficult
because of the inconsistent use of evaluation metrics, the necessity for
multiple metrics, and the fact that some of these measures can easily be
tricked by simple rule-based systems. We propose a novel distance measure
between two sets of molecules, called Fr\'echet ChemNet distance (FCD), that
can be used as an evaluation metric for generative models. The FCD is similar
to a recently established performance metric for comparing image generation
methods, the Fr\'echet Inception Distance (FID). Whereas the FID uses one of
the hidden layers of InceptionNet, the FCD utilizes the penultimate layer of a
deep neural network called ChemNet, which was trained to predict drug
activities. Thus, the FCD metric takes into account chemically and biologically
relevant information about molecules, and also measures the diversity of the
set via the distribution of generated molecules. The FCD's advantage over
previous metrics is that it can detect if generated molecules are a) diverse
and have similar b) chemical and c) biological properties as real molecules. We
further provide an easy-to-use implementation that only requires the SMILES
representation of the generated molecules as input to calculate the FCD.
Implementations are available at: https://www.github.com/bioinf-jku/FCDComment: Implementations are available at:
https://www.github.com/bioinf-jku/FC
Identifying Copy Number Variations based on Next Generation Sequencing Data by a Mixture of Poisson Model
Next generation sequencing (NGS) technologies have profoundly impacted biological research and are becoming more and more popular due to cost effectiveness and their speed. NGS can be utilized to identify DNA structural variants, namely copy number variations (CNVs) which showed association with diseases like HIV, diabetes II, or cancer.

There have been first approaches to detect CNVs in NGS data, where most of them detect a CNV by a significant difference of read counts within neighboring windows at the chromosome. However these methods suffer from systematical variations of the underlying read count distributions along the chromosome due to biological and technical noise. In contrast to these global methods, we locally model the read count distribution characteristics by a mixture of Poissons which allows to incorporate a linear dependence between copy numbers and read counts. Model selection is performed in a Bayesian framework by maximizing the posterior through an EM algorithm. We define a CNV call which indicates a deviation of the Poisson mixture parameters from the null hypothesis represented by the prior which is a model for constant copy number across the samples. A CNV call requires sufficient information in the data to push the model away from the null hypothesis given by the prior.

We test our approach on the HapMap cohort where we rediscovered previously found CNVs which validates our approach. It is then tested on the tumor genome data set where we are able to considerably increase the detection while reducing the false discoveries.

Quantification of Uncertainty with Adversarial Models
Quantifying uncertainty is important for actionable predictions in real-world
applications. A crucial part of predictive uncertainty quantification is the
estimation of epistemic uncertainty, which is defined as an integral of the
product between a divergence function and the posterior. Current methods such
as Deep Ensembles or MC dropout underperform at estimating the epistemic
uncertainty, since they primarily consider the posterior when sampling models.
We suggest Quantification of Uncertainty with Adversarial Models (QUAM) to
better estimate the epistemic uncertainty. QUAM identifies regions where the
whole product under the integral is large, not just the posterior.
Consequently, QUAM has lower approximation error of the epistemic uncertainty
compared to previous methods. Models for which the product is large correspond
to adversarial models (not adversarial examples!). Adversarial models have both
a high posterior as well as a high divergence between their predictions and
that of a reference model. Our experiments show that QUAM excels in capturing
epistemic uncertainty for deep learning models and outperforms previous methods
on challenging tasks in the vision domain
VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification
Being able to identify regions within or around proteins, to which ligands
can potentially bind, is an essential step to develop new drugs. Binding site
identification methods can now profit from the availability of large amounts of
3D structures in protein structure databases or from AlphaFold predictions.
Current binding site identification methods heavily rely on graph neural
networks (GNNs), usually designed to output E(3)-equivariant predictions. Such
methods turned out to be very beneficial for physics-related tasks like binding
energy or motion trajectory prediction. However, the performance of GNNs at
binding site identification is still limited potentially due to the lack of
dedicated nodes that model hidden geometric entities, such as binding pockets.
In this work, we extend E(n)-Equivariant Graph Neural Networks (EGNNs) by
adding virtual nodes and applying an extended message passing scheme. The
virtual nodes in these graphs are dedicated quantities to learn representations
of binding sites, which leads to improved predictive performance. In our
experiments, we show that our proposed method VN-EGNN sets a new
state-of-the-art at locating binding site centers on COACH420, HOLO4K and
PDBbind2020
- …