5,494 research outputs found

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    Generative Pretrained Autoregressive Transformer Graph Neural Network applied to the Analysis and Discovery of Novel Proteins

    Full text link
    We report a flexible language-model based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling, based on an attention neural network that integrates transformer and graph convolutional architectures in a causal multi-headed graph mechanism, to realize a generative pretrained model. The model is applied to predict secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks. Further trained on inverse tasks, the model is rendered capable of designing proteins with these properties as target features. The model is formulated as a general framework, completely prompt-based, and can be adapted for a variety of downstream tasks. We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance, beyond what would be possible by training a model on each dataset alone. Case studies are presented to validate the method, yielding protein designs specifically focused on structural proteins, but also exploring the applicability in the design of soluble, antimicrobial biomaterials. While our model is trained to ultimately perform 8 distinct tasks, with available datasets it can be extended to solve additional problems. In a broader sense, this work illustrates a form of multiscale modeling that relates a set of ultimate building blocks (here, byte-level utf8 characters that define the nature of the physical system at hand) to complex output. This materiomic scheme captures complex emergent relationships between universal building block and resulting properties via a synergizing learning capacity to express a set of potentialities embedded in the knowledge used in training, via the interplay of universality and diversity

    Highly Accurate Fragment Library for Protein Fold Recognition

    Get PDF
    Proteins play a crucial role in living organisms as they perform many vital tasks in every living cell. Knowledge of protein folding has a deep impact on understanding the heterogeneity and molecular functions of proteins. Such information leads to crucial advances in drug design and disease understanding. Fold recognition is a key step in the protein structure discovery process, especially when traditional computational methods fail to yield convincing structural homologies. In this work, we present a new protein fold recognition approach using machine learning and data mining methodologies. First, we identify a protein structural fragment library (Frag-K) composed of a set of backbone fragments ranging from 4 to 20 residues as the structural “keywords” that can effectively distinguish between major protein folds. We firstly apply randomized spectral clustering and random forest algorithms to construct representative and sensitive protein fragment libraries from a large-scale of high-quality, non-homologous protein structures available in PDB. We analyze the impacts of clustering cut-offs on the performance of the fragment libraries. Then, the Frag-K fragments are employed as structural features to classify protein structures in major protein folds defined by SCOP (Structural Classification of Proteins). Our results show that a structural dictionary with ~400 4- to 20-residue Frag-K fragments is capable of classifying major SCOP folds with high accuracy. Then, based on Frag-k, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multimodal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolution neural network (CNN) to classify the fragment vectors into the corresponding folds. Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition

    DISCRETIZED GEOMETRIC APPROACHES TO THE ANALYSIS OF PROTEIN STRUCTURES

    Get PDF
    Proteins play crucial roles in a variety of biological processes. While we know that their amino acid sequence determines their structure, which in turn determines their function, we do not know why particular sequences fold into particular structures. My work focuses on discretized geometric descriptions of protein structure—conceptualizing native structure space as composed of mostly discrete, geometrically defined fragments—to better understand the patterns underlying why particular sequence elements correspond to particular structure elements. This discretized geometric approach is applied to multiple levels of protein structure, from conceptualizing contacts between residues as interactions between discrete structural elements to treating protein structures as an assembly of discrete fragments. My earlier work focused on better understanding inter-residue contacts and estimating their energies statistically. By scoring structures with energies derived from a stricter notion of contact, I show that native protein structures can be identified out of a set of decoy structures more often than when using energies derived from traditional definitions of contact and how this has implications for the evaluation of predictions that rely on structurally defined contacts for validation. Demonstrating how useful simple geometric descriptors of structure can be, I then show that these energies identify native structures on par with well-validated, detailed, atomistic energy functions. Moving to a higher level of structure, in my later work I demonstrate that discretized, geometrically defined structural fragments make good objects for the interactive assembly of protein backbones and present a software application which lets users do so. Finally, I use these fragments to generate structure-conditioned statistical energies, generalizing the classic idea of contact energies by incorporating specific structural context, enabling these energies to reflect the interaction geometries they come from. These structure-conditioned energies contain more information about native sequence preferences, correlate more highly with experimentally determined energies, and show that pairwise sequence preferences are tightly coupled to their structural context. Considered jointly, these projects highlight the degree to which protein structures and the interactions they comprise can be understood as geometric elements coming together in finely tuned ways

    DeepFrag-k: A Fragment-Based Deep Learning Approach for Protein Fold Recognition

    Get PDF
    Background: One of the most essential problems in structural bioinformatics is protein fold recognition. In this paper, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features at fragment level to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multi-modal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolutional neural network (CNN) to classify the fragment vector into the corresponding fold. Results: Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition. Conclusions: There is a set of fragments that can serve as structural “keywords” distinguishing between major protein folds. The deep learning architecture in DeepFrag-k is able to accurately identify these fragments as structure features to improve protein fold recognition

    Allosteric Regulation at the Crossroads of New Technologies: Multiscale Modeling, Networks, and Machine Learning

    Get PDF
    Allosteric regulation is a common mechanism employed by complex biomolecular systems for regulation of activity and adaptability in the cellular environment, serving as an effective molecular tool for cellular communication. As an intrinsic but elusive property, allostery is a ubiquitous phenomenon where binding or disturbing of a distal site in a protein can functionally control its activity and is considered as the “second secret of life.” The fundamental biological importance and complexity of these processes require a multi-faceted platform of synergistically integrated approaches for prediction and characterization of allosteric functional states, atomistic reconstruction of allosteric regulatory mechanisms and discovery of allosteric modulators. The unifying theme and overarching goal of allosteric regulation studies in recent years have been integration between emerging experiment and computational approaches and technologies to advance quantitative characterization of allosteric mechanisms in proteins. Despite significant advances, the quantitative characterization and reliable prediction of functional allosteric states, interactions, and mechanisms continue to present highly challenging problems in the field. In this review, we discuss simulation-based multiscale approaches, experiment-informed Markovian models, and network modeling of allostery and information-theoretical approaches that can describe the thermodynamics and hierarchy allosteric states and the molecular basis of allosteric mechanisms. The wealth of structural and functional information along with diversity and complexity of allosteric mechanisms in therapeutically important protein families have provided a well-suited platform for development of data-driven research strategies. Data-centric integration of chemistry, biology and computer science using artificial intelligence technologies has gained a significant momentum and at the forefront of many cross-disciplinary efforts. We discuss new developments in the machine learning field and the emergence of deep learning and deep reinforcement learning applications in modeling of molecular mechanisms and allosteric proteins. The experiment-guided integrated approaches empowered by recent advances in multiscale modeling, network science, and machine learning can lead to more reliable prediction of allosteric regulatory mechanisms and discovery of allosteric modulators for therapeutically important protein targets
    • …
    corecore