233 research outputs found
Dynamics of domain coverage of the protein sequence universe
Background
The currently known protein sequence space consists of millions of sequences in public databases and is rapidly expanding. Assigning sequences to families leads to a better understanding of protein function and the nature of the protein universe. However, a large portion of the current protein space remains unassigned and is referred to as its ādark matterā. Results
Here we suggest that true size of ādark matterā is much larger than stated by current definitions. We propose an approach to reducing the size of ādark matterā by identifying and subtracting regions in protein sequences that are not likely to contain any domain. Conclusions
Recent improvements in computational domain modeling result in a decrease, albeit slowly, in the relative size of ādark matterā; however, its absolute size increases substantially with the growth of sequence data
DOMMINO: a database of macromolecular interactions
With the growing number of experimentally resolved structures of macromolecular complexes, it becomes clear that the interactions that involve protein structures are mediated not only by the protein domains, but also by various non-structured regions, such as interdomain linkers, or terminal sequences. Here, we present DOMMINO (http://dommino.org), a comprehensive database of macromolecular interactions that includes the interactions between protein domains, interdomain linkers, N- and C-terminal regions and protein peptides. The database complements SCOP domain annotations with domain predictions by SUPERFAMILY and is automatically updated every week. The database interface is designed to provide the user with a three-stage pipeline to study macromolecular interactions: (i) a flexible search that can include a PDB ID, type of interaction, SCOP family of interacting proteins, organism name, interaction keyword and a minimal threshold on the number of contact pairs; (ii) visualization of subunit interaction network, where the user can investigate the types of interactions within a macromolecular assembly; and (iii) visualization of an interface structure between any pair of the interacting subunits, where the user can highlight several different types of residues within the interfaces as well as study the structure of the corresponding binary complex of subunits
Structural properties of the linkers connecting the n- and c- terminal domains in the mocr bacterial transcriptional regulators
Peptide inter-domain linkers are peptide segments covalently linking two adjacent domains within a protein. Linkers play a variety of structural and functional roles in naturally occurring proteins. In this work we analyze the sequence properties of the predicted linker regions of the bacterial transcriptional regulators belonging to the recently discovered MocR subfamily of the GntR regulators. Analyses were carried out on the MocR sequences taken from the phyla Actinobacteria, Firmicutes, Alpha-, Beta- and Gammaproteobacteria. The results suggest that MocR linkers display phylum-specific characteristics and unique features different from those already described for other classes of inter-domain linkers. They show an average length significantly higher: 31.8 Ā± 14.3 residues reaching a maximum of about 150 residues. Compositional propensities displayed general and phylum-specific trends. Pro is dominating in all linkers. Dyad propensity analysis indicate ProāPro as the most frequent amino acid pair in all linkers. Physicochemical properties of the linker regions were assessed using amino acid indices relative to different features: in general, MocR linkers are flexible, hydrophilic and display propensity for Ī²-turn or coil conformations. Linker sequences are hypervariable: only similarities between MocR linkers from organisms related at the level of species or genus could be found with sequence searches. The results shed light on the properties of the linker regions of the new MocR subfamily of bacterial regulators and may provide knowledge-based rules for designing artificial linkers with desired properties. Ā© 2016 The Author(s
Recommended from our members
Learning from cadherin structures and sequences: affinity determinants and protein architecture
Cadherins are a family of cell-surface proteins mediating adhesion that are important in development and maintenance of tissues. The family is defined by the repeating cadherin domain (EC) in their extracellular region, but they are diverse in terms of protein size, architecture and cellular function. The best-understood subfamily is the type I classical cadherins, which are found in vertebrates and have five EC domains. Among the five different type I classical cadherins, the binding interactions are highly specific in their homo- and heterophilic binding affinities, though their sequences are very similar. As previously shown, E- and N-cadherins, two prototypic members of the subfamily, differ in their homophilic K_D by about an order of magnitude, while their heterophilic affinity is intermediate. To examine the source of the binding affinity differences among type I cadherins, we used crystal structures, analytical ultracentrifugation (AUC), surface plasmon resonance (SPR), and electron paramagnetic resonance (EPR) studies. Phylogenetic analysis and binding affinity behavior show that the type I cadherins can be further divided into two subgroups, with E- and N-cadherin representing each. In addition to the affinity differences in their wild-type binding through the strand-swapped interface, a second interface also shows an affinity difference between E- and N-cadherin. This X-dimer interface, which is a weakly binding kinetic intermediate in E-cadherin, has a much stronger affinity in N-cadherin: nearly as strong as N-cadherin wild-type binding. In the swapped and X-dimer interactions of E- and N-cadherin, differences in hydrophobic surface area can mostly account for the affinity difference. However, several mutants of N-cadherin have a K_D an order of magnitude stronger even than the wild-type N-cadherin. In these mutants, the source of the strong affinity seems to be entropic stabilization through an equilibrium between multiple conformations with similar energies. We thus have a molecular-level understanding of vertebrate classical cadherins, with a detailed understanding of their adhesive mechanism and their binding affinity determinants. However, the adhesive mechanisms of cadherins from invertebrates, which are structurally divergent yet function in similar roles, remain unknown. We present crystal structures of the predicted N-terminal region of Drosophila N-cadherin (DN-cadherin). Of the 16 total predicted EC domains, we have crystallized the EC1-3 and EC1-4 segments. While the linker regions for the EC1-EC2 and EC3-EC4 pairs display binding of three Ca^2+ ions similar to that in vertebrate cadherins, domains EC2 and EC3 are joined in a bent orientation by a novel, previously uncharacterized Ca^2+-free linker. Based on sequence analysis of the further ECs of DN-cadherin, we predict another such Ca^2+-free linker between EC7 and EC8. Biophysical analysis demonstrates that a construct containing the first nine predicted EC domains of DN-cadherin forms homodimers with affinity similar to vertebrate classical cadherins. Intriguingly, this segment contains both the crystallized and predicted Ca^2+-free linkers, suggesting a complex binding interface. Sequence analysis of the cadherin family reveals that similar Ca^2+-free linkers are widely distributed in the ectodomains of both vertebrate and invertebrate cadherins. In cases of long cadherins, there are frequently multiple Ca^2+-free linkers in a single protein chain. It thus appears that a combination of calcium-binding and calcium-free linkers can allow cadherins to form three-dimensional arrangements that are more complex than the extended, calcium-rigidified structures in classical cadherins. Discovery of the Ca^2+-free linker, together with the differing numbers and arrangements of ECs and other domain types, implies that the cadherin superfamily is more structurally diverse than previously thought. Because little is known about the function and even less about the structure of the majority of the superfamily, studying the linear architecture (i.e. the precise sequence of ECs and the characteristics of the interdomain linkers) at the scale of the superfamily would give significant new insights on the structure and function of less-understood cadherins. With this motivation, we have constructed a cadherin database with relevant information on two different scales: the protein and the domain. On the whole protein level, we represent the architecture of each cadherin by recording the arrangement of ECs, different linker types, and other (non-EC) domain types in the protein. On the individual EC level, based on the sequence, we record the domain characteristics that give rise to the different structural features at the protein level. We have annotated over 9,600 proteins from 560 organisms, containing over 69,000 ECs; and built an online interface to search and access this information. Our aim is to provide a tool for understanding the protein architecture, function, and relationships among cadherins, a structurally diverse protein family. Together, these studies examine the relationships between sequence, structure and function of cadherins at different scales. In the classical cadherin study, small changes of one or two residues can dramatically alter the dimer conformations and thus lead to large differences in binding affinity between highly related cadherins, or between wild-type and mutant proteins. These seemingly small mutations can result in even higher binding affinity with the effect of entropic stabilization by multiple conformations. In DN-cadherin, the absence of certain calcium-binding motifs in adjacent ECs leads to a new linker type and a new interdomain orientation. This, in turn, has great implications in the global shape, and possibly the binding mechanism of the protein. The cadherin database aims to provide information at different structural levels in order to allow users to draw connections between primary sequence, domain structure and protein architecture, to ultimately learn about protein function
Recommended from our members
The Role of Initiation Factor Dynamics in Translation Initiation
Like most biological polymerization reactions, ribosome-catalyzed protein synthesis, or translation, can be divided into initiation, elongation, and termination stages. Initiation is the rate-limiting stage of translation and a critical site for translational control of gene expression. Throughout all stages of protein synthesis, the ribosome is aided by essential protein co-factors known as translation factors. I have studied the role that two translation initiation factors, IF1 and IF3, play in the mechanism and regulation of translation initiation in Escherichia coli. Specifically, I have used single-molecule fluorescence resonance energy transfer (smFRET) as a primary tool for investigating how the dynamics of IF1 and IF3 regulate the accuracy with which the translational machinery selects an initiator transfer RNA (tRNA) and the correct messenger RNA (mRNA) start codon during the initiation stage of protein synthesis
Computational modelling of multidomain proteins with covarying residue pairs
The vast majority of known protein sequences have no solved three-dimensional structure at all, and the remaining ones usually have not been completely characterised, due to the limitations of experimental structural biology techniques. Structural genomics projects have helped increase the coverage of the protein structure universe, but most available structures still consist of either individual domains or sets of relatively small ones. This has prompted the development of computational methods for protein structure prediction, as well as for multidomain architecture modelling. One appealing idea to achieve this goal consists of detecting residue-residue contacts from multiple sequence alignments, under the assumption that they covary in order to maintain the local microenvironment and the overall stability of protein structures. After early limited success, this type of analysis has lately witnessed substantial progress, thanks to theoretical advances in disentangling genuine from spurious instances of correlation. Unsurprisingly, structural bioinformatics has promptly and successfully applied these improved tools to model globular and transmembrane proteins, along with guiding the assembly of protein complexes. However, the efficacy of these methods in the context of multidomain protein modelling has not yet been investigated. In this thesis state-of-the-art methods for predicting contacts from sequence data have been evaluated and used to build models of two-domain protein structures. Firstly, the ability of alternative methods to identify interdomain contacts was examined in a reference set of experimentally solved structures. Secondly, predicted contacts were employed to score docking models and select near-native solutions accordingly. Finally, predicted contacts were used to guide the assembly of individual domains in a multidomain modelling protocol
Crystal structure of the ZP-N domain of ZP3 reveals the core fold of animal egg coats
Species-specific recognition between the egg extracellular matrix (zona pellucida) and sperm is the first, crucial step of mammalian fertilization. Zona pellucida filament components ZP3 and ZP2 act as sperm receptors, and mice lacking either of the corresponding genes produce oocytes without a zona pellucida and are completely infertile. Like their counterparts in the vitelline envelope of non-mammalian eggs and many other secreted eukaryotic proteins, zona pellucida subunits polymerize using a 'zona pellucida (ZP) domain' module, whose conserved amino-terminal part (ZP-N) was suggested to constitute a domain of its own. No atomic structure has been reported for ZP domain proteins, and there is no structural information on any conserved vertebrate protein that is essential for fertilization and directly involved in egg-sperm binding. Here we describe the 2.3 Ƅngstrƶm (A) resolution structure of the ZP-N fragment of mouse primary sperm receptor ZP3. The ZP-N fold defines a new immunoglobulin superfamily subtype with a beta-sheet extension characterized by an E' strand and an invariant tyrosine residue implicated in polymerization. The structure strongly supports the presence of ZP-N repeats within the N-terminal region of ZP2 and other vertebrate zona pellucida/vitelline envelope proteins, with implications for overall egg coat architecture, the post-fertilization block to polyspermy and speciation. Moreover, it provides an important framework for understanding human diseases caused by mutations in ZP domain proteins and developing new methods of non-hormonal contraception
Investigation of sequence features of hinge-bending regions in proteins with domain movements using kernel logistic regression
Background: Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins. Results: The DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations. Conclusion: In contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk
- ā¦