4 research outputs found
PDBench: Evaluating Computational Methods for Protein Sequence Design
Proteins perform critical processes in all living systems: converting solar
energy into chemical energy, replicating DNA, as the basis of highly performant
materials, sensing and much more. While an incredible range of functionality
has been sampled in nature, it accounts for a tiny fraction of the possible
protein universe. If we could tap into this pool of unexplored protein
structures, we could search for novel proteins with useful properties that we
could apply to tackle the environmental and medical challenges facing humanity.
This is the purpose of protein design.
Sequence design is an important aspect of protein design, and many successful
methods to do this have been developed. Recently, deep-learning methods that
frame it as a classification problem have emerged as a powerful approach.
Beyond their reported improvement in performance, their primary advantage over
physics-based methods is that the computational burden is shifted from the user
to the developers, thereby increasing accessibility to the design method.
Despite this trend, the tools for assessment and comparison of such models
remain quite generic. The goal of this paper is to both address the timely
problem of evaluation and to shine a spotlight, within the Machine Learning
community, on specific assessment criteria that will accelerate impact.
We present a carefully curated benchmark set of proteins and propose a number
of standard tests to assess the performance of deep learning based methods. Our
robust benchmark provides biological insight into the behaviour of design
methods, which is essential for evaluating their performance and utility. We
compare five existing models with two novel models for sequence prediction.
Finally, we test the designs produced by these models with AlphaFold2, a
state-of-the-art structure-prediction algorithm, to determine if they are
likely to fold into the intended 3D shapes.Comment: 9 pages, 5 figure
CC+: A Searchable Database of Validated Coiled Coils in PDB Structures and AlphaFold2 Models
αâHelical coiled coils are common tertiary and quaternary elements of protein structure. In coiled coils, two or more α helices wrap around each other to form bundles. This apparently simple structural motif can generate many architectures and topologies. Coiled coilâforming sequences can be predicted from heptad repeats of hydrophobic and polar residues, hpphppp , although this is not always reliable. Alternatively, coiledâcoil structures can be identified using the program SOCKET, which finds knobsâintoâholes (KIH) packing between side chains of neighboring helices. SOCKET also classifies coiledâcoil architecture and topology, thus allowing sequenceâtoâstructure relationships to be garnered. In 2009, we used SOCKET to create a relational database of coiledâcoil structures, CC + , from the RCSB Protein Data Bank (PDB). Here, we report an update of CC + following an update of SOCKET (to Socket2) and the recent explosion of structural data and the success of AlphaFold2 in predicting protein structures from genome sequences. With the mostâstringent SOCKET parameters, CC + contains â12,000 coiledâcoil assemblies from experimentally determined structures, and â120,000 potential coiledâcoil structures within singleâchain models predicted by AlphaFold2 across 48 proteomes. CC + allows these and other lessâstringently defined coiled coils to be searched at various levels of structure, sequence, and sideâchain interactions. The identified coiled coils can be viewed directly from CC + using the Socket2 application, and their associated data can be downloaded for further analyses. CC + is available freely at http://coiledcoils.chm.bris.ac.uk/CCPlus/Home.html . It will be updated automatically. We envisage that CC+ could be used to understand coiledâcoil assemblies and their sequenceâtoâstructure relationships, and to aid protein design and engineering.</p
Rationally seeded computational protein design of É-helical barrels
Computational protein design is advancing rapidly. Here we describe efficient routes starting from validated parallel and antiparallel peptide assemblies to design two families of α-helical barrel proteins with central channels that bind small molecules. Computational designs are seeded by the sequences and structures of defined de novo oligomeric barrel-forming peptides, and adjacent helices are connected by loop building. For targets with antiparallel helices, short loops are sufficient. However, targets with parallel helices require longer connectors; namely, an outer layer of helixâturnâhelixâturnâhelix motifs that are packed onto the barrels. Throughout these computational pipelines, residues that define open states of the barrels are maintained. This minimizes sequence sampling, accelerating the design process. For each of six targets, just two to six synthetic genes are made for expression in Escherichia coli. On average, 70% of these genes express to give soluble monomeric proteins that are fully characterized, including high-resolution structures for most targets that match the design models with high accuracy