24 research outputs found
A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization
We show that a large class of Estimation of Distribution Algorithms,
including, but not limited to, Covariance Matrix Adaption, can be written as a
Monte Carlo Expectation-Maximization algorithm, and as exact EM in the limit of
infinite samples. Because EM sits on a rigorous statistical foundation and has
been thoroughly analyzed, this connection provides a new coherent framework
with which to reason about EDAs
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
Fast differentiable DNA and protein sequence optimization for molecular design
Designing DNA and protein sequences with improved function has the potential
to greatly accelerate synthetic biology. Machine learning models that
accurately predict biological fitness from sequence are becoming a powerful
tool for molecular design. Activation maximization offers a simple design
strategy for differentiable models: one-hot coded sequences are first
approximated by a continuous representation which is then iteratively optimized
with respect to the predictor oracle by gradient ascent. While elegant, this
method suffers from vanishing gradients and may cause predictor pathologies
leading to poor convergence. Here, we build on a previously proposed
straight-through approximation method to optimize through discrete sequence
samples. By normalizing nucleotide logits across positions and introducing an
adaptive entropy variable, we remove bottlenecks arising from overly large or
skewed sampling parameters. The resulting algorithm, which we call Fast
SeqProp, achieves up to 100-fold faster convergence compared to previous
versions of activation maximization and finds improved fitness optima for many
applications. We demonstrate Fast SeqProp by designing DNA and protein
sequences for six deep learning predictors, including a protein structure
predictor.Comment: All code available at http://www.github.com/johli/seqprop; Moved
example sequences from Suppl to new Figure 2, Added new benchmark comparison
to Section 4.3, Moved some technical comparisons to Suppl, Added new Methods
sectio
Conservative objective models are a special kind of contrastive divergence-based energy model
In this work we theoretically show that conservative objective models (COMs)
for offline model-based optimisation (MBO) are a special kind of contrastive
divergence-based energy model, one where the energy function represents both
the unconditional probability of the input and the conditional probability of
the reward variable. While the initial formulation only samples modes from its
learned distribution, we propose a simple fix that replaces its gradient ascent
sampler with a Langevin MCMC sampler. This gives rise to a special
probabilistic model where the probability of sampling an input is proportional
to its predicted reward. Lastly, we show that better samples can be obtained if
the model is decoupled so that the unconditional and conditional probabilities
are modelled separately
Causal Graphs Underlying Generative Models: Path to Learning with Limited Data
Training generative models that capture rich semantics of the data and
interpreting the latent representations encoded by such models are very
important problems in unsupervised learning. In this work, we provide a simple
algorithm that relies on perturbation experiments on latent codes of a
pre-trained generative autoencoder to uncover a causal graph that is implied by
the generative model. We leverage pre-trained attribute classifiers and perform
perturbation experiments to check for influence of a given latent variable on a
subset of attributes. Given this, we show that one can fit an effective causal
graph that models a structural equation model between latent codes taken as
exogenous variables and attributes taken as observed variables. One interesting
aspect is that a single latent variable controls multiple overlapping subsets
of attributes unlike conventional approach that tries to impose full
independence. Using a pre-trained RNN-based generative autoencoder trained on a
dataset of peptide sequences, we demonstrate that the learnt causal graph from
our algorithm between various attributes and latent codes can be used to
predict a specific property for sequences which are unseen. We compare
prediction models trained on either all available attributes or only the ones
in the Markov blanket and empirically show that in both the unsupervised and
supervised regimes, typically, using the predictor that relies on Markov
blanket attributes generalizes better for out-of-distribution sequences