2,125 research outputs found

    Classification of Protein Sequences using the Growing Self-Organizing Map

    Get PDF
    Protein sequence analysis is an important task in bioinformatics. The classification of protein sequences into groups is beneficial for further analysis of the structures and roles of a particular group of protein in biological process. It also allows an unknown or newly found sequence to be identified by comparing it with protein groups that have already been studied. In this paper, we present the use of growing self-organizing map (GSOM), an extended version of the self-organizing map (SOM) in classifying protein sequences. With its dynamic structure, GSOM facilitates the discovery of knowledge in a more natural way. This study focuses on two aspects; analysis of the effect of spread factor parameter in the GSOM to the node growth and the identification of grouping and subgrouping under different level of abstractions by using the spread factor

    Cluster identification and separation in the growing self-organizing map: Application in protein sequence classification

    Get PDF
    Growing self-organizing map (GSOM) has been introduced as an improvement to the self-organizing map (SOM) algorithm in clustering and knowledge discovery. Unlike the traditional SOM, GSOM has a dynamic structure which allows nodes to grow reflecting the knowledge discovered from the input data as learning progresses. The spread factor parameter (SF) in GSOM can be utilized to control the spread of the map, thus giving an analyst a flexibility to examine the clusters at different granularities. Although GSOM has been applied in various areas and has been proven effective in knowledge discovery tasks, no comprehensive study has been done on the effect of the spread factor parameter value to the cluster formation and separation. Therefore, the aim of this paper is to investigate the effect of the spread factor value towards cluster separation in the GSOM. We used simple k-means algorithm as a method to identify clusters in the GSOM. By using Davies-Bouldin index, clusters formed by different values of spread factor are obtained and the resulting clusters are analyzed. In this work, we show that clusters can be more separated when the spread factor value is increased. Hierarchical clusters can then be constructed by mapping the GSOM clusters at different spread factor values. © 2009 Springer-Verlag London Limited

    Sequence- and structure-based approaches to deciphering enzyme evolution in the Haloalkonoate Dehalogenase superfamily

    Full text link
    Understanding how changes in functional requirements of the cell select for changes in protein sequence and structure is a fundamental challenge in molecular evolution. This dissertation delineates some of the underlying evolutionary forces using as a model system, the Haloalkanoate Dehalogenase Superfamily (HADSF). HADSF members have unique cap-core architecture with the Rossmann-fold core domain accessorized by variable cap domain insertions (delineated by length, topology, and point of insertion). To identify the boundaries of variable domain insertions in protein sequences, I have developed a comprehensive computational strategy (CapPredictor or CP) using a novel sequence alignment algorithm in conjunction with a structure-guided sequence profile. Analysis of more than 40,000 HADSF sequences led to the following observations: (i) cap-type classes exhibit similar distributions across different phyla, indicating existence of all cap-types in the last universal common ancestor, and (ii) comparative analysis of the predicted cap-type and functional diversity indicated that cap-type does not dictate the divergence of substrate recognition and chemical pathway, and hence biological function. By analyzing a unique dataset of core- and cap-domain-only protein structures, I investigated the consequences of the accessory cap domain on the sequence-structure relationship of the core domain. The relationship between sequence and structure divergence in the core fold was shown to be monotonic and independent of the corresponding cap type. However, core domains with the same cap type bore a greater similarity than the core domains with different cap types, suggesting coevolution of the cap and core domains. Remarkably, a few degrees of freedom are needed to describe the structural diversity in the Rossmann fold accounting for the majority of the observed structural variance. Finally, I examined the location and role of conserved residue positions and co-evolving residue pairs in the core domain in the context of the cap domain. Positions critical for function were conserved while non-conserved positions mapped to highly mobile regions. Notably, we found exponential dependence of co-variance on inter-residue distance. Collectively, these novel algorithms and analyses contribute to an improved understanding of enzyme evolution, especially in the context of the use of domain insertions to expand substrate specificity and chemical mechanism

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution

    Understanding the Structural and Functional Importance of Early Folding Residues in Protein Structures

    Get PDF
    Proteins adopt three-dimensional structures which serve as a starting point to understand protein function and their evolutionary ancestry. It is unclear how proteins fold in vivo and how this process can be recreated in silico in order to predict protein structure from sequence. Contact maps are a possibility to describe whether two residues are in spatial proximity and structures can be derived from this simplified representation. Coevolution or supervised machine learning techniques can compute contact maps from sequence: however, these approaches only predict sparse subsets of the actual contact map. It is shown that the composition of these subsets substantially influences the achievable reconstruction quality because most information in a contact map is redundant. No strategy was proposed which identifies unique contacts for which no redundant backup exists. The StructureDistiller algorithm quantifies the structural relevance of individual contacts and identifies crucial contacts in protein structures. It is demonstrated that using this information the reconstruction performance on a sparse subset of a contact map is increased by 0.4 A, which constitutes a substantial performance gain. The set of the most relevant contacts in a map is also more resilient to false positively predicted contacts: up to 6% of false positives are compensated before reconstruction quality matches a naive selection of contacts without any false positive contacts. This information is invaluable for the training to new structure prediction methods and provides insights into how robustness and information content of contact maps can be improved. In literature, the relevance of two types of residues for in vivo folding has been described. Early folding residues initiate the folding process, whereas highly stable residues prevent spontaneous unfolding events. The structural relevance score proposed by this thesis is employed to characterize both types of residues. Early folding residues form pivotal secondary structure elements, but their structural relevance is average. In contrast, highly stable residues exhibit significantly increased structural relevance. This implies that residues crucial for the folding process are not relevant for structural integrity and vice versa. The position of early folding residues is preserved over the course of evolution as demonstrated for two ancient regions shared by all aminoacyl-tRNA synthetases. One arrangement of folding initiation sites resembles an ancient and widely distributed structural packing motif and captures how reverberations of the earliest periods of life can still be observed in contemporary protein structures

    Advancing systems biology of yeast through machine learning and comparative genomics

    Get PDF
    Synthetic biology has played a pivotal role in accomplishing the production of high value commodities, pharmaceuticals, and bulk chemicals. Fueled by the breakthrough of synthetic biology and metabolic engineering, Saccharomyces cerevisiae and various other yeasts (such as Yarrowia lipolytica, Pichia pastoris) have been proven to be promising microbial cell factories and are frequently used in scientific studies. However, the cellular metabolism and physiological properties for most of the yeast species have not been characterized in detail. To address these knowledge gaps, this thesis aims to leverage the large amounts of data available for yeast species and use state-of-the-art machine learning techniques and comparative genomic analysis to gain a deeper insight into yeast traits and metabolism.In this thesis, machine learning was applied to various unresolved biological problems on yeasts, i.e., gene essentiality, enzyme turnover number (kcat), and protein production. In the first part of the work, machine learning approaches were employed to predict gene essentiality based on sequence features and evolutionary features. It was demonstrated that the essential gene prediction could be substantially improved by integrating evolution-based features. Secondly, a high-quality deep learning model DLKcat was developed to predict kcat\ua0values by combining a graph neural network for substrates and a convolutional neural network for proteins. By predicting kcat profiles for 343 yeast/fungi species, enzyme-constrained models were reconstructed and used to further elucidate the cellular metabolism on a large scale. Lastly, a random forest algorithm was adopted to investigate feature importance analysis on protein production, it was found that post-translational modifications (PTMs) have a relatively higher impact on protein production compared with amino acid composition. In comparative genomics, a comprehensive toolbox HGTphyloDetect was developed to facilitate the identification of horizontal gene transfer (HGT) events. Case studies on some yeast species demonstrated the ability of HGTphyloDetect to identify horizontally acquired genes with high accuracy. In addition, through systematic evolution analysis (e.g., HGT, gene family expansion) and genome-scale metabolic model simulation, the underlying mechanisms for substrate utilization were further probed across large-scale yeast species

    Quality Control Mechanisms of Molecular Chaperones in the Folding and Degradation of Client Proteins

    Full text link
    Molecular chaperones are essential proteins that assist in the folding of substrate ‘client’ proteins to adopt their functionally active three-dimensional structures. The process of protein folding in the cell occurs in a highly concentrated crowded cellular environment among various other macromolecules and amidst various cell stresses which result in issues of aberrant protein folding into toxic species and aggregates. Thus, to counteract these stressors, cells have evolved a complex network of chaperone proteins to maintain protein homeostasis, or proteostasis. Hsp70 is an essential molecular chaperone that acts on clients important for a wide variety of cellular functions. Hsp70 can facilitate refolding of clients to regain their function. However, it can also target client proteins to proteasomal degradation. Turnover of aberrantly folded or aggregation prone proteins such as tau implicates Hsp70 in various pathologies including neurodegenerative diseases. Another class of protein chaperones, termed ‘holdases’, act to delay protein aggregation. The small heat shock proteins (sHSP) systems possess such activity, binding to non-native conformations of clients. sHsps such as Hsp27 and αB crystallin exist as distributions of large oligomeric species that respond dynamically to pH and temperature stresses. Recent studies have demonstrated oligomeric rearrangements occur for sHsps to protect client proteins. A major outstanding question is how do these oligomeric assemblies’ complex structures sense cell stress or protein unfolding or aggregation. In addition to sensing cell stress, sHsps and holdase chaperones are also capable of bridging with the activities of other classes of chaperones, including the Hsp70 chaperone system. Hsp70 functions in concert with a network of co-chaperone proteins which diversify its protein folding capabilities. BAG3 is a nucleotide exchange factor (NEF) that facilitates the exchange of ADP and ATP in Hsp70. In addition, interactions with sHsp family chaperones have emerged, making it a promising target in elucidating the link between these two functionally distinct chaperone systems. The overall theme to my thesis work has been to characterize protein homeostasis achieved through pro-folding and pro-degradation pathways. A major focus of my thesis concerns the ability of Hsp70 to work in concert with the CHIP E3 ubiquitin ligase to target tau for polyubiquitination in a chaperone dependent manner, thus facilitating protein turnover. Another focus has been on a pro-folding function of chaperones, the so-called holdase function, where I have explored the structural rearrangements of the sHsp αB crystallin as well as another multifunctional chaperone, peroxiredoxin, and how these conformational changes and oligomeric rearrangements trigger with external stress and correlate with activation of chaperone activity. This thesis also explores the cooperation between sHsps and Hsp70 to xiii facilitate protein refolding, where I characterize rearrangements that occur in the Hsp27 oligomer distribution modulated by BAG3, and its implications on Hsp70 binding. One of the major techniques utilized in my thesis work is electron microscopy, obtaining structural information of protein complexes, from obtaining low resolution size distributions of sHsp oligomers to pushing resolution of Hsp70 in complex with CHIP beyond quaternary structural information to sub-nanometer resolution of the peroxiredoxin in its active chaperone form in complex with substrate. These studies serve as a foundation for future work on obtaining the structural basis of the decision process where chaperone proteins decide the fate of their client substrates.PHDBiological ChemistryUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144085/1/orvvdom_1.pd

    Untersuchung der Struktur und Interaktion mit allosterischen Modulatoren der Familie C GPCRs mit Hilfe von Sequenz-, Struktur- und Ligand-basierten Verfahren

    Get PDF
    This study focuses on structural features of a particular GPCR type, the family C GPCRs. Structure- and ligand-based approaches were adopted for prediction of novel mGluR5 binding ligand and their binding modes. The objectives of this study were: 1. An analysis of function and structural implication of amino acids in the TM region of family C GPCRs. 2. The prediction of the TM domain structure of mGluR5. 3. The discovery of novel selective allosteric modulators of mGluR5 by virtual screening. 4. The prediction of a ligand binding mode for the allosteric binding site in mGluR5. GPCRs are a super-family of structurally related proteins although their primary amino acid sequence can be diverse. Using sequence information a conservation analysis of family C GPCRs should be applied to reveal characteristic differences and similarities with respect function, folding and ligand binding. Using experimental data and conservation analysis the allosteric binding site of mGluR5 should be characterized regarding NAM and PAM and selective ligand binding. For further evaluation experimental knowledge about family A GPCRs as well as conservation between vertebrate rhodopsins was planned to be compared to results obtained for family C GPCRs (Section 4.1 Conservation analysis of family C GPCRs). Since no receptor structure is available for any family C GPCR, discussion of conserved sequence positions between family A and C GPCRs requires the prediction of a receptor structure for mGluR5 using a family A receptor as template. In order to predict the mGluR5 structure a sequence alignment to a GPCR template protein will have to be proposed and GPCR specific features considered in structure calculation (Section 4.1.4 Structure prediction of mGluR5). The obtained structure was intended to be involved in ligand binding mode prediction of newly discovered active molecules. For discovery of novel selective mGluR modulators several ligand-based virtual screening protocols were adapted and evaluated. Prediction models were derived for selection of possibly active molecules using a diverse collection of known mGluR binding ligands. For that purpose a data collection of known mGluR binding ligands should be established and this reference collection analyzed with respect to different ligand activity classes, NAM or PAM and selective modulators. The prediction of novel NAMs and PAMs using several combinations of 2D-, 3D-, pharmacophore or molecule shape encoding methods with machine learning techniques and similarity determining methods should be tested in a prospective manner (Section 4.2 Virtual screening for novel mGluR modulators). In collaboration with Merz Pharmaceuticals (Merz GmbH & Co. KGaA, Frankfurt am Main, Germany) the modulating effect of a few hundred molecules should be approved in a functional cell-based assay. With the objective to predict a binding mode of the discovered active molecules, molecule docking should be applied using the allosteric binding site of the modeled mGluR5 structure (Section 4.2.4 Modeling of binding modes). Predicted ligand binding modes are to be correlated to conservation profiles that had resulted from the sequence-based entropy analysis and information from mutation experiments, and shall be compared to known ligand binding poses from crystal structures of family A GPCRs.Im Rahmen dieser Arbeit wurden Konzepte zur AufklĂ€rung struktureller und funktioneller Eigenschaften von G-Protein gekoppelten Rezeptoren (GPCR) der Familie C entwickelt und angewendet. Mit unterschiedlichen Methodiken der Bio- und Chemieinformatik orientiert an experimentellen Ergebnissen wurden Fragestellungen bezĂŒglich des Funktionsmechanismus von GPCRs untersucht. In Verlauf wurde anhand verfĂŒgbarer experimenteller Daten aus Mutations- und Ligandenbindungsstudien ein Vergleich konservierter Bereiche der Rezeptor-Familien A und C angefertigt. Die Konserviertheitsanalyse stĂŒtzte sich auf die Berechnung der Shannon-Entropie und wurde fĂŒr ein multiples Sequenzalignment von TransmembrandomĂ€nen unterschiedlicher 96 Familie C GPCRs ermittelt. Konservierte Bereiche wurden mit Hilfe experimenteller Daten interpretiert und insbesondere zur Definition von Regionen in der allosterischen Bindetasche hinsichtlich SelektivitĂ€t verwendet. Mit dem Ziel, neue selektive allosterische Modulatoren fĂŒr den metabotropen Glutamatrezeptor des Typs fĂŒnf (mGluR5) zu finden, wurden mehrere Liganden-basierte AnsĂ€tze zur virtuellen Vorhersage der AktivitĂ€t von MolekĂŒlen entwickelt und getestet. Die dabei angewendete Strategie basierte auf der Kenntnis bereits bekannter Liganden, deren Strukturen und AktivitĂ€tswerte fĂŒr das Erstellen von Vorhersagemodelle genutzt werden konnten. Die prospektive Vorhersage stĂŒtzte sich auf unterschiedliche Methoden zur Ähnlichkeitsberechnung und Arten der MolekĂŒlkodierung. Die Testung der MolekĂŒle erfolgte hinsichtlich ihrer modulatorischen Wirkung am mGluR5. Die Art der Messung erfasste die Änderungen des Ca2+-Levels in der Zelle. mGluR5-bindende Modulatoren wurden zur SelektivitĂ€tsbestimmung einer Testung am mGluR1 unterzogen. Insgesamt konnten 8 von 228 getesteten MolekĂŒlen im AktivitĂ€tsbereich unter 10μM ermittelt werden, darunter befand sich ein positiver allosterischer Modulator. Von den restlichen sieben negativen Modulatoren (NAM) waren fĂŒnf selektiv fĂŒr mGluR5. Alle identifizierten NAMs wurden mittels molekularem Dockings auf mögliche Interaktion mit der TransmembrandomĂ€ne von mGluR5 untersucht. Die Bindungshypothese entsprach einer Überlagerung der gefundenen MolekĂŒle und ihrer möglicher Interaktionspunkte. Exemplarisch am mGluR5 konnte somit die Eignung einer modellierten GPCR-Struktur fĂŒr eine Hypothesengenerierung bezĂŒglich Ligandenbindung und struktureller ZusammenhĂ€nge untersucht werden
    • 

    corecore