10,977 research outputs found
EigenFold: Generative Protein Structure Prediction with Diffusion Models
Protein structure prediction has reached revolutionary levels of accuracy on
single structures, yet distributional modeling paradigms are needed to capture
the conformational ensembles and flexibility that underlie biological function.
Towards this goal, we develop EigenFold, a diffusion generative modeling
framework for sampling a distribution of structures from a given protein
sequence. We define a diffusion process that models the structure as a system
of harmonic oscillators and which naturally induces a cascading-resolution
generative process along the eigenmodes of the system. On recent CAMEO targets,
EigenFold achieves a median TMScore of 0.84, while providing a more
comprehensive picture of model uncertainty via the ensemble of sampled
structures relative to existing methods. We then assess EigenFold's ability to
model and predict conformational heterogeneity for fold-switching proteins and
ligand-induced conformational change. Code is available at
https://github.com/bjing2016/EigenFold.Comment: ICLR MLDD workshop 202
Towards Joint Sequence-Structure Generation of Nucleic Acid and Protein Complexes with SE(3)-Discrete Diffusion
Generative models of macromolecules carry abundant and impactful implications
for industrial and biomedical efforts in protein engineering. However, existing
methods are currently limited to modeling protein structures or sequences,
independently or jointly, without regard to the interactions that commonly
occur between proteins and other macromolecules. In this work, we introduce
MMDiff, a generative model that jointly designs sequences and structures of
nucleic acid and protein complexes, independently or in complex, using joint
SE(3)-discrete diffusion noise. Such a model has important implications for
emerging areas of macromolecular design including structure-based transcription
factor design and design of noncoding RNA sequences. We demonstrate the utility
of MMDiff through a rigorous new design benchmark for macromolecular complex
generation that we introduce in this work. Our results demonstrate that MMDiff
is able to successfully generate micro-RNA and single-stranded DNA molecules
while being modestly capable of joint modeling DNA and RNA molecules in
interaction with multi-chain protein complexes. Source code:
https://github.com/Profluent-Internships/MMDiff.Comment: 15 pages, 11 figures, presented at the NeurIPS 2023 Machine Learning
in Structural Biology (MLSB) workshop. Code available at
https://github.com/Profluent-Internships/MMDif
Growing ecosystem of deep learning methods for modeling protein\unicode{x2013}protein interactions
Numerous cellular functions rely on protein\unicode{x2013}protein
interactions. Efforts to comprehensively characterize them remain challenged
however by the diversity of molecular recognition mechanisms employed within
the proteome. Deep learning has emerged as a promising approach for tackling
this problem by exploiting both experimental data and basic biophysical
knowledge about protein interactions. Here, we review the growing ecosystem of
deep learning methods for modeling protein interactions, highlighting the
diversity of these biophysically-informed models and their respective
trade-offs. We discuss recent successes in using representation learning to
capture complex features pertinent to predicting protein interactions and
interaction sites, geometric deep learning to reason over protein structures
and predict complex structures, and generative modeling to design de novo
protein assemblies. We also outline some of the outstanding challenges and
promising new directions. Opportunities abound to discover novel interactions,
elucidate their physical mechanisms, and engineer binders to modulate their
functions using deep learning and, ultimately, unravel how protein interactions
orchestrate complex cellular behaviors.Comment: 19 pages, added model names to discussio
Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure
Diffusion generative models have emerged as a powerful framework for
addressing problems in structural biology and structure-based drug design.
These models operate directly on 3D molecular structures. Due to the
unfavorable scaling of graph neural networks (GNNs) with graph size as well as
the relatively slow inference speeds inherent to diffusion models, many
existing molecular diffusion models rely on coarse-grained representations of
protein structure to make training and inference feasible. However, such
coarse-grained representations discard essential information for modeling
molecular interactions and impair the quality of generated structures. In this
work, we present a novel GNN-based architecture for learning latent
representations of molecular structure. When trained end-to-end with a
diffusion model for de novo ligand design, our model achieves comparable
performance to one with an all-atom protein representation while exhibiting a
3-fold reduction in inference time.Comment: This paper appeared as a spotlight paper at the NeurIPS 2023
Generative AI and Biology Worksho
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
A generative model for protein contact networks
In this paper we present a generative model for protein contact networks. The
soundness of the proposed model is investigated by focusing primarily on
mesoscopic properties elaborated from the spectra of the graph Laplacian. To
complement the analysis, we study also classical topological descriptors, such
as statistics of the shortest paths and the important feature of modularity.
Our experiments show that the proposed model results in a considerable
improvement with respect to two suitably chosen generative mechanisms,
mimicking with better approximation real protein contact networks in terms of
diffusion properties elaborated from the Laplacian spectra. However, as well as
the other considered models, it does not reproduce with sufficient accuracy the
shortest paths structure. To compensate this drawback, we designed a second
step involving a targeted edge reconfiguration process. The ensemble of
reconfigured networks denotes improvements that are statistically significant.
As a byproduct of our study, we demonstrate that modularity, a well-known
property of proteins, does not entirely explain the actual network architecture
characterizing protein contact networks. In fact, we conclude that modularity,
intended as a quantification of an underlying community structure, should be
considered as an emergent property of the structural organization of proteins.
Interestingly, such a property is suitably optimized in protein contact
networks together with the feature of path efficiency.Comment: 18 pages, 67 reference
Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies
Statistical models for families of evolutionary related proteins have
recently gained interest: in particular pairwise Potts models, as those
inferred by the Direct-Coupling Analysis, have been able to extract information
about the three-dimensional structure of folded proteins, and about the effect
of amino-acid substitutions in proteins. These models are typically requested
to reproduce the one- and two-point statistics of the amino-acid usage in a
protein family, {\em i.e.}~to capture the so-called residue conservation and
covariation statistics of proteins of common evolutionary origin. Pairwise
Potts models are the maximum-entropy models achieving this. While being
successful, these models depend on huge numbers of {\em ad hoc} introduced
parameters, which have to be estimated from finite amount of data and whose
biophysical interpretation remains unclear. Here we propose an approach to
parameter reduction, which is based on selecting collective sequence motifs. It
naturally leads to the formulation of statistical sequence models in terms of
Hopfield-Potts models. These models can be accurately inferred using a mapping
to restricted Boltzmann machines and persistent contrastive divergence. We show
that, when applied to protein data, even 20-40 patterns are sufficient to
obtain statistically close-to-generative models. The Hopfield patterns form
interpretable sequence motifs and may be used to clusterize amino-acid
sequences into functional sub-families. However, the distributed collective
nature of these motifs intrinsically limits the ability of Hopfield-Potts
models in predicting contact maps, showing the necessity of developing models
going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR
- …