10 research outputs found

    New experimental and theoretical tools for studying protein systems with elements of structural disorder

    Get PDF
    Disordered proteins are one class of proteins which do not possess well-folded three-dimensional structures as their native conformations. Many eukaryotic proteins have been found to be fully disordered or contain certain disordered regions. Disordered proteins usually display several characteristic properties, such as increased motional freedom and the conformational heterogeneity caused by that. The elements of structural disorder are commonly involved in many important biological functions and are implicated in many diseases. Therefore, the study of disordered proteins has become one of the most important research topics in recent years. This thesis presents results from three different research projects; the common feature is that all systems being studied contain varying amount of structural disorder. Most results have been obtained based on experimental nuclear magnetic resonance (NMR) studies and molecular dynamics (MD) simulations. Both are among the most popular biophysical techniques for studying molecular dynamics. The first project investigates the relationship between domain cooperativity and residual dipolar coupling (RDC) parameters based on a series of two-domain chimera proteins with disordered linkers. Many eukaryotic proteins contain multiple domains and their biological functions are closely related to the property of domain cooperativity, which is often regulated by the linker region. Therefore it is necessary to develop suitable tools to characterize linker region properties in order to better understand biological functions of multidomain proteins. The second project is about the development of NMR pulse sequences for studying disordered proteins. Two new NMR pulse sequences, PD-CPMG and CP-HISQC, have been developed. Both experiments are well suited for studying intrinsically disordered proteins (IDPs) or intrinsically disordered regions (IDRs) under physiological conditions. These two experiments produce higher precision for 15N R2 rates measurement or higher sensitivity in 1H– 15N HSQC spectra respectively. Besides, they also show many advantages over most other existing experiments for studying IDPs. The last project is about protein-peptide encounter complex study based on Crk-Sos model system. The ten-residue Sos peptide serves as a minimal model for disordered proteins. Encounter complex is an important type of intermediate state formed during many protein interactions. Such complexes are usually characterized by a large amount of motional freedom and conformational heterogeneity. Therefore their properties are considerably different from tight-binding complexes which are more commonly studied. Although it is usually quite difficult to study encounter complexes using standard biophysical techniques, in this project we have successfully characterized structural and dynamic properties of Crk-Sos electrostatic encounter complex with a combination of MD simulations and experimental NMR approaches. It can be directly seen from the structural model based on MD trajectories that Sos peptide in the encounter complex remains highly dynamic, sampling large area on the surface of Crk N-SH3 domain. Such strategy can also be utilized for studying many other encounter complexes involving disordered proteins or peptide

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Prediction of protein-protein interaction types using machine learning approaches

    Get PDF
    Prediction and analysis of protein-protein interactions (PPIs) is an important problem in life science research because of the fundamental roles of PPIs in many biological processes in living cells. One of the important problems surrounding PPIs is the identification and prediction of different types of complexes, which are characterized by properties such as type and numbers of proteins that interact, stability of the proteins, and also duration of the interactions. This thesis focuses on studying the temporal and stability aspects of the PPIs mostly using structural data. We have addressed the problem of predicting obligate and non-obligate protein complexes, as well as those aspects related to transient versus permanent because of the importance of non-obligate and transient complexes as therapeutic targets for drug discovery and development. We have presented a computational model to predict-protein interaction types using our proposed physicochemical features of desolvation and electrostatic energies and also structural and sequence domain-based features. To achieve a comprehensive comparison and demonstrate the strength of our proposed features to predict PPI types, we have also computed a wide range of previously used properties for prediction including physical features of interface area, chemical features of hydrophobicity and amino acid composition, physicochemical features of solvent-accessible surface area (SASA) and atomic contact vectors (ACV). After extracting the main features of the complexes, a variety of machine learning approaches have been used to predict PPI types. The prediction is performed via several state-of-the-art classification techniques, including linear dimensionality reduction (LDR), support vector machine (SVM), naive Bayes (NB) and k-nearest neighbor (k-NN). Moreover, several feature selection algorithms including gain ratio (GR), information gain (IG), chi-square (Chi2) and minimum redundancy maximum relevance (mRMR) are applied on the available datasets to obtain more discriminative and relevant properties to distinguish between these two types of complexes Our computational results on different datasets confirm that using our proposed physicochemical features of desolvation and electrostatic energies lead to significant improvements on prediction performance. Moreover, using structural and sequence domains of CATH and Pfam and doing biological analysis help us to achieve a better insight on obligate and non-obligate complexes and their interactions

    Molecular Dynamics for Synthetic Biology

    Get PDF
    Synthetic biology is the field concerned with the design, engineering, and construction of organisms and biomolecules. Biomolecules such as proteins are nature's nano-bots, and provide both a shortcut to the construction of nano-scale tools and insight into the design of abiotic nanotechnology. A fundamental technique in protein engineering is protein fusion, the concatenation of two proteins so that they form domains of a new protein. The resulting fusion protein generally retains both functions, especially when a linker sequence is introduced between the two domains to allow them to fold independently. Fusion proteins can have features absent from all of their components; for example, FRET biosensors are fusion proteins of two fluorescent proteins with a binding domain. When the binding domain forms a complex with a ligand, its dynamics translate the concentration of the ligand to the ratio of fluorescence intensities via FRET. Despite these successes, protein engineering remains laborious and expensive. Computer modelling has the potential to improve the situation by enabling some design work to occur virtually. Synthetic biologists commonly use fast, heuristic structure prediction tools like ROSETTA, I-TASSER and FoldX, despite their inaccuracy. By contrast, molecular dynamics with modern force fields has proven itself accurate, but sampling sufficiently to solve problems accurately and quickly enough to be relevant to experimenters remains challenging. In this thesis, I introduce molecular dynamics to a structural biology audience, and discuss the challenges and theory behind the technique. With this knowledge, I introduce synthetic biology through a review of fluorescent sensors. I then develop a simple computational tool, Rangefinder, for the design of one variety of these sensors, and demonstrate its ability to predict sensor performance experimentally. I demonstrate the importance of the choice of linker with yet another sensor whose performance depends critically thereon. In chapter 6, I investigate the structure of a conserved, repeating linker sequence connecting two domains of the malaria circumsporozoite protein. Finally, I develop a multi-scale enhanced sampling molecular dynamics approach to predicting the structure and dynamics of fusion proteins. It is my hope that this work contributes to the structural biology community's understanding of molecular dynamics and inspires new techniques developed for protein engineering

    Using Structural Bioinformatics to Model and Design Membrane Proteins

    Get PDF
    Cells require membrane proteins for a wide spectrum of critical functions. Transmembrane proteins enable cells to communicate with its environment, catalysis, ion transport and scaffolding. The functional roles of membrane proteins are specified by their sequence composition and precise three dimensional folding. The exact mechanisms driving folding of membrane proteins is still not fully understood. Further, the association between membrane proteins occurs with pinpoint specificity. For example, there exists common sequence features within families of transmembrane receptors, yet there is little cross talk between families. Therefore, we ask how membrane proteins dial in their specificity and what factors are responsible for adoption of native structure. Advancements in membrane protein structure determination methods has been followed by a sharp increase in three dimensional structures. Structural bioinfomatics has been utilized effectively to study water soluble proteins. The field is now entering an era where structural bioinformatics can be applied to modeling membrane proteins without structure and engineering novel membrane proteins. The transmembrane domains of membrane proteins were first categorized structurally. From this analysis, we are able to describe the ways in which membrane proteins fold and associate. We further derived sequence profiles for the commonly occurring structural motifs, enabling us to investigate the role of amino acids within the bilayer. Utilizing these tools, a transmembrane structural model was constructed of principle cell surface receptors (integrins). The structural model enabled understanding of possible mechanisms used to signal and to propose a novel membrane protein packing motif. In addition, novel scoring functions for membrane proteins were developed and applied to modeling membrane proteins. We derived the first all-atom membrane statistical potential and introduced the usage of exposed volume. These potentials allowed modeling of complex interactions in membrane proteins, such as salt bridges. To understand the geometric preferences of salt bridges, we surveyed a structural database. We learned about large biases in salt bridge orientations that will be useful in modeling and design. Lastly, we combine these structural bioinformatic efforts, enabling us to model membrane proteins in ways which were previously inaccessible

    Predicting multidomain protein structure and function via co-evolved amino acids and application to polyketide synthases

    Get PDF
    Proteins are an important building block of life, and they are responsible for many processes in living organisms. Therefore, understanding their functions and working mechanisms has vital importance to answer many questions about diseases and is a basis for the development of novel drugs. Three dimensional (3D) structure of proteins determine their functions; therefore, the determination of the 3D structures of proteins has been studied widely. Although many experimental techniques have been developed to determine the structures of proteins, they have limitations, especially for large protein complexes. Protein structure can help understand protein function, as can looking at conserved residues, but typically time consuming mutagenesis experiments combined with protein function assays are needed. As an alternative to the experimental methods, researchers have been working on developing computational approaches. While it is relatively easy to predict structures when the structure of a homologous protein is known, as it can be used as a template, the prediction of protein structures in the absence of a template is more challenging. For template-free predictions, coevolved amino acid residue pairs, predicted from the alignment of the homologous sequences, provided promising improvements in the field. More recently, successful implementation of the artificial neural networks, fed by the predicted coevolved residue pairs, improved the accuracy of the predicted structures further. Although there are promising developments in the coevolution based approaches, especially for the structure prediction of small/medium-sized proteins, more developments are needed for predicting protein structure, particularly of large protein complexes. Here, we show that the prediction of distances between residue pairs, via deep neural networks fed by predictions of coevolved residue pairs, improves the accuracy of structure prediction in small/medium-sized proteins. The prediction of residue pair distances, using a similar approach, in two interacting domains also allows us to predict how two domains on the same chain interact with each other. Further, we show that prediction of coevolved residue groups, via statistical coupling analysis, allows us to determine functional boundaries of domains and diverged amino acid patterns in the sub-types of the domains in a multi-domain protein complex, a polyketide synthase. We found that using predicted distances, in addition to the predicted residue pairs in contact, allows us to generate structures closer to the experimental structures, and to select them as the final models in a straightforward approach. Additionally, we reveal that the distances of the residue pairs on interacting domain pairs can be predicted accurately leading to the successful prediction of the structural interface between two interacting proteins when the interface surface is large, and the sequence alignment is comprehensive enough. Finally, we found that functional domain boundaries, which are consistent with the experimental studies, can be determined. Also, some coevolved residue groups have distinct amino acid patterns in different domain sub-types including the positions that have already known as the fingerprint motifs of the different sub-types. These approaches can be applied to predict the structures of individual domains and to predict how two domains interact with each other, which can be used to predict the structure of multi-domain proteins. The work on polyketides here demonstrates how these developments might be applied, since identifying domain boundaries and residues important for substrate specificity should aid in the design of novel polyketide synthases and thus of novel polyketides. This in itself is an important development given the commercial and medicinal importance of polyketides, but also opens the way to similar analysis on other multidomain proteins

    Mechanism of action of non-synonymous single nucleotide variations associated with α-carbonic anhydrases II, IV and VIII

    Get PDF
    The carbonic anhydrase (CA) group of enzymes are Zinc (Zn2+) metalloproteins responsible for the reversible hydration of CO2 to bicarbonate (BCT or HCO− 3 ) and protons (H+) for the facilitation of acid-base balance and homeostasis within the body. Across all organisms, a minimum of six CA families exist, including, α (alpha), ÎČ (beta), Îł (gamma), ÎŽ (delta), η (eta) and ζ (zeta). Some organisms can have more than one family, with exception to humans that contain the α family solely. The α-CA family comprises of 16 isoforms (CA-I to CA-XV) including the CA-VIII, CA-X and CA-XI acatalytic isoforms. Of the catalytic isoforms, CA-II and CA-IV possess one of the fastest rates of reaction, and any disturbances to the function of these enzymes results in CA deficiencies and undesirable phenotypes. CA-II deficiencies result in osteopetrosis with renal tubular acidosis and cerebral calcification, whereas CA-IV deficiencies result in retinitis pigmentosa 17 (RP17). Phenotypic effects generally manifest as a result of poor protein folding and function due to the presence of non-synonymous single nucleotide variations (nsSNVs). Even within the acatalytic isoforms such as CA-VIII that llosterically regulates the affinity of inositol triphosphate (IP3) for the IP3 receptor type 1 (ITPR1) and regulates calcium (Ca2+) signalling, the presence of SNVs also causes phenotypes cerebellar ataxia, mental retardation, and dysequilibrium syndrome 3 (CAMRQ3). Currently the majority of research into the CAs is focused on the inhibition of these proteins to achieve therapeutic effects in patients via the control of HCO− production or reabsorption as observed in glaucoma and diuretic medications. Little research has therefore been devoted into the identification of stabilising or activating compound that could rescue protein function in the case of deficiencies. The main aim of this research was to identify and characterise the effects of nsSNVs on the structure and function of CA-II, CA-IV and CA-VIII to set a foundation for rare disease studies into the CA group of proteins. Combined bioinformatics approaches divided into four main objectives were implemented. These included variant identification, sequence analysis and protein characterisation, force field (FF) parameter generation, molecular dynamics (MD) simulation and dynamic residue network analysis (DRN). Six variants for each of the CA-II, CA-IV and CA-VIII proteins with pathogenic annotations were identified from the HUMA and Ensembl databases. These included the pathogenic variants K18E, K18Q, H107Y, P236H, P236R and N252D for CA-II. CA-IV included the pathogenic R69H, R219C and R219S, and benign N86K, N177K and V234I variants. CA-VIII included pathogenic S100A, S100P, G162R and R237Q, and benign S100L and E109D variants. CA-II has been more extensively studied than CA-IV and CA-VIII, therefore residues essential to its function and stability are known. To discover important residues and regions within the CA-IV and CA-VIII proteins sequence and motif analysis was performed across the α-CA family, using CA-II as a reference. Sequence analysis identified multiple conserved residues between the two acatalytic CA-II and CA-IV, and the acatalytic CA-VIII isoforms that were proposed to be essential for protein stability. With exception to the benign N86K CA-IV variant, none of the other pathogenic or benign CA-II, CA-IV and CA-VIII SNVs were located at functionally or structurally important residues. Motif analysis identified 11 conserved and important motifs within the α-CA family. Several of the identified variants were located on these motifs including K18E, K18Q, H107Y and N252D (CA-II); N86K, R219C, R219S and V234I (CA-IV); and E109D, G162R and R237Q (CA-VIII). As there were no x-ray crystal structures of the variant proteins, homology modelling was performed to calculate the protein structures for characterisation. In CA-VIII, the substitution of Ser for Pro at position 100 (variant S100P) resulted in destruction of the ÎČ-sheet that the SNV was located on. Little is known about the mechanism of interaction between CA-VIII and ITPR1, and residues involved. SiteMap and CPORT were used to identify binding site amino for CA-VIII and results identified 38 potential residues. Traditional FFs are incapable of performing MD simulations of metalloproteins. The AMBER ff14SB FF was extended and Zn2+ FF parameters calculated to add support for metalloprotein MD simulations. In the protein, Zn2+ was noted to have a charge less than +1. Variant effects on protein structure were then investigated using MD simulations. Root mean square deviation (RMSD) and radius of gyration (Rg) results indicated subtle SNV effects to the variant global structure in CA-II and CA-IV. However, with regards to CA-VIII RMSD analysis highlighted that variant presence was associated with increases to the structural rigidity of the protein. Principal component analysis (PCA) in conjunction with free energy analysis was performed to observe variant effects on protein conformational sampling in 3D space. The binding of BCT to CA-II induced greater protein conformational sampling and was associated with higher free energy. In CA-IV and CA-VIII PCA analysis revealed key differences in the mechanism of action of pathogenic and benign SNVs. In CA-IV, wild-type (WT) and benign variant protein structures clustered into single low energy well hinting at the presence of more stable structures. Pathogenic variants were associated with higher free energy and proteins sampled more conformations without settling into a low energy well. PCA analysis of CA-VIII indicated the opposite to CA-IV. Pathogenic variants were clustered into low energy wells, while the WT and benign variants showed greater conformational sampling. Dynamic cross correlation (DCC) analysis was performed using the MD-TASK suite to determine variant effects on residue movement. CA-II WT protein revealed that BCT and CO2 were associated with anti-correlated and correlated residue movement, highlighting at opposite mechanisms. In CA-IV and CA-VIII variant presence resulted in a change to residue correlation compared to the WT proteins. DRN analysis was performed to investigate SNV effects of residue accessibility and communication. Results demonstrated that SNVs are associated with allosteric effects on the CA protein structures, and effects are located on the stability assisting residues of the aromatic clusters and the active site of the proteins. CA-II studies discovered that Glu117 is the most important residue for communication, and variant presence results in a decrease to the usage of the residue. This effect was greatest in the CA-II H107Y SNV, and suggests that variants could have an effect on Zn2+ dissociation from the active site. Decreases to the usage of Zn2+ coordinating residues were also noted. Where this occurred, compensatory increases to the usage of other primary and secondary coordination residues were observed, that could possibly assist with the maintenance of Zn2+ within the active site. The CA-IV variants R69H and R219C highlighted potentially similar pathogenic mechanisms, whereas N86K and N177K hinted at potentially similar benign mechanisms. Within CA-VIII, variant presence was associated with changes to the accessibility of the N-terminal binding site residues. The benign CA-VIII variants highlighted possible compensatory mechanisms, whereby as one group of N-terminal residues loses accessibility, there was an increase to the accessibility of other binding site residues to possibly balance the effect. Catalytically, the proton shuttle residue His64 in CA-II was found to occupy a novel conformation named the “faux in” that brought the imidazole group even closer to the Zn2+ compared to the “in” conformation. Overall, compared to traditional MD simulations the incorporation of DRN allowed more detailed investigations into the variant mechanisms of action. This highlights the importance of network analysis in the study of the effects of missense mutations on the structure and function of proteins. Investigations of diseases at the molecular level is essential in the identification of disease pathogenesis and assists with the development of specifically tailored and better treatment options especially in the cases of genetically associated rare diseases

    Sequence Determinants of the Individual and Collective Behaviour of Intrinsically Disordered Proteins

    Get PDF
    Intrinsically disordered proteins and protein regions (IDPs) represent around thirty percent of the eukaryotic proteome. IDPs do not fold into a set three dimensional structure, but instead exist in an ensemble of inter-converting states. Despite being disordered, IDPs are decidedly not random; well-defined - albeit transient - local and long-range interactions give rise to an ensemble with distinct statistical biases over many length-scales. Among a variety of cellular roles, IDPs drive and modulate the formation of phase separated intracellular condensates, non-stoichiometric assemblies of protein and nucleic acid that serve many functions. In this work, we have explored how the amino acid sequence of IDPs determines their conformational behaviour, and how sequence and single chain behaviour influence their collective behaviour in the context of phase separation. In part I, in a series of studies, we used simulation, theory, and statistical analysis coupled with a wide range of experimental approaches to uncover novel rules that further explore how primary sequence and local structure influence the global and local behaviour of disordered proteins, with direct implications for protein function and evolution. We found that amino acid sidechains counteract the intrinsic collapse of the peptide backbone, priming the backbone for interaction and providing a fully reconciliatory explanation for the mechanism of action associated with the denaturants urea and GdmCl. We discovered that proline can engender a conformational buffering effect in IDPs to counteract standard electrostatic effects, and that the patterning those proline residues can be a crucial determinant of the conformational ensemble. We developed a series of tools for analysing primary sequences on a proteome wide scale and used them to discover that different organisms can have substantially different average sequence properties. Finally, we determined that for the normally folded protein NTL9, the unfolded state under folding conditions is relatively expanded but has well defined native and non-native structural preferences. In part II, we identified a novel mode of phase separation in biology, and explored how this could be tuned through sequence design. We discovered that phase separated liquids can be many orders of magnitude more dilute than simple mean-field theories would predict, and developed an analytic framework to explain and understand this phenomenon. Finally, we designed, developed and implemented a novel lattice-based simulation engine (PIMMS) to provide sequence-specific insight into the determinants of conformational behaviour and phase separation. PIMMS allows us to accurately and rapidly generate sequence-specific conformational ensembles and run simulations of hundreds of polymers with the goal of allowing us to systematically elucidate the link between primary sequence of phase separation

    Developing a framework for semi-automated rule-based modelling for neuroscience research

    Get PDF
    Dynamic modelling has significantly improved our understanding of the complex molecular mechanisms underpinning neurobiological processes. The detailed mechanistic insights these models offer depend on the availability of a diverse range of experimental observations. Despite the huge increase in biomolecular data generation from novel high-throughput technologies and extensive research in bioinformatics and dynamical modelling, efficient creation of accurate dynamical models remains highly challenging. To study this problem, three perspectives are considered: comparison of modelling methods, prioritisation of results and analysis of primary data sets. Firstly, I compare two models of the DARPP-32 signalling network: a classically defined model with ordinary differential equations (ODE) and its equivalent, defined using a novel rule-based (RB) paradigm. The RB model recapitulates the results of the ODE model, but offers a more expressive and flexible syntax that can efficiently handle the “combinatorial complexity” commonly found in signalling networks, and allows ready access to fine-grain details of the emerging system. RB modelling is particularly well suited to encoding protein-centred features such as domain information and post-translational modification sites. Secondly, I propose a new pipeline for prioritisation of molecular species that arise during model simulation using a recently developed algorithm based on multivariate mutual information (CorEx) coupled with global sensitivity analysis (GSA) using the RKappa package. To efficiently evaluate the importance of parameters, Hilber-Schmidt Independence Criterion (HSIC)-based indices are aggregated into a weighted network that allows compact analysis of the model across conditions. Finally, I describe an approach for the development of disease-specific dynamical models using genes known to be associated with Attention Deficit Hyperactivity Disorder (ADHD) as an exemplar. Candidate disease genes are mapped to a selection of datasets that are potentially relevant to the modelling process (e.g. interactions between proteins and domains, protein-domain and kinase-substrates mappings) and these are jointly analysed using network clustering and pathway enrichment analyses to evaluate their coverage and utility in developing rule-based models
    corecore