406 research outputs found

    Predicting MoRFs in protein sequences using HMM profiles

    Get PDF

    Machine Learning based Protein Sequence to (un)Structure Mapping and Interaction Prediction

    Get PDF
    Proteins are the fundamental macromolecules within a cell that carry out most of the biological functions. The computational study of protein structure and its functions, using machine learning and data analytics, is elemental in advancing the life-science research due to the fast-growing biological data and the extensive complexities involved in their analyses towards discovering meaningful insights. Mapping of protein’s primary sequence is not only limited to its structure, we extend that to its disordered component known as Intrinsically Disordered Proteins or Regions in proteins (IDPs/IDRs), and hence the involved dynamics, which help us explain complex interaction within a cell that is otherwise obscured. The objective of this dissertation is to develop machine learning based effective tools to predict disordered protein, its properties and dynamics, and interaction paradigm by systematically mining and analyzing large-scale biological data. In this dissertation, we propose a robust framework to predict disordered proteins given only sequence information, using an optimized SVM with RBF kernel. Through appropriate reasoning, we highlight the structure-like behavior of IDPs in disease-associated complexes. Further, we develop a fast and effective predictor of Accessible Surface Area (ASA) of protein residues, a useful structural property that defines protein’s exposure to partners, using regularized regression with 3rd-degree polynomial kernel function and genetic algorithm. As a key outcome of this research, we then introduce a novel method to extract position specific energy (PSEE) of protein residues by modeling the pairwise thermodynamic interactions and hydrophobic effect. PSEE is found to be an effective feature in identifying the enthalpy-gain of the folded state of a protein and otherwise the neutral state of the unstructured proteins. Moreover, we study the peptide-protein transient interactions that involve the induced folding of short peptides through disorder-to-order conformational changes to bind to an appropriate partner. A suite of predictors is developed to identify the residue-patterns of Peptide-Recognition Domains from protein sequence that can recognize and bind to the peptide-motifs and phospho-peptides with post-translational-modifications (PTMs) of amino acid, responsible for critical human diseases, using the stacked generalization ensemble technique. The involved biologically relevant case-studies demonstrate possibilities of discovering new knowledge using the developed tools

    Predicting MoRFs in protein sequences using HMM profiles

    Get PDF
    Background: Intrinsically Disordered Proteins (IDPs) lack an ordered three-dimensional structure and are enriched in various biological processes. The Molecular Recognition Features (MoRFs) are functional regions within IDPs that undergo a disorder-to-order transition on binding to a partner protein. Identifying MoRFs in IDPs using computational methods is a challenging task. Methods: In this study, we introduce hidden Markov model (HMM) profiles to accurately identify the location of MoRFs in disordered protein sequences. Using windowing technique, HMM profiles are utilised to extract features from protein sequences and support vector machines (SVM) are used to calculate a propensity score for each residue. Two different SVM kernels with high noise tolerance are evaluated with a varying window size and the scores of the SVM models are combined to generate the final propensity score to predict MoRF residues. The SVM models are designed to extract maximal information between MoRF residues, its neighboring regions (Flanks) and the remainder of the sequence (Others). Results: To evaluate the proposed method, its performance was compared to that of other MoRF predictors; MoRFpred and ANCHOR. The results show that the proposed method outperforms these two predictors. Conclusions: Using HMM profile as a source of feature extraction, the proposed method indicates improvement in predicting MoRFs in disordered protein sequence

    Prediction of DNA-Binding Proteins and their Binding Sites

    Get PDF
    DNA-binding proteins play an important role in various essential biological processes such as DNA replication, recombination, repair, gene transcription, and expression. The identification of DNA-binding proteins and the residues involved in the contacts is important for understanding the DNA-binding mechanism in proteins. Moreover, it has been reported in the literature that the mutations of some DNA-binding residues on proteins are associated with some diseases. The identification of these proteins and their binding mechanism generally require experimental techniques, which makes large scale study extremely difficult. Thus, the prediction of DNA-binding proteins and their binding sites from sequences alone is one of the most challenging problems in the field of genome annotation. Since the start of the human genome project, many attempts have been made to solve the problem with different approaches, but the accuracy of these methods is still not suitable to do large scale annotation of proteins. Rather than relying solely on the existing machine learning techniques, I sought to combine those using novel “stacking technique” and used the problem-specific architectures to solve the problem with better accuracy than the existing methods. This thesis presents a possible solution to the DNA-binding proteins prediction problem which performs better than the state-of-the-art approaches

    Prediction of DNA-Binding Proteins and their Binding Sites

    Get PDF
    DNA-binding proteins play an important role in various essential biological processes such as DNA replication, recombination, repair, gene transcription, and expression. The identification of DNA-binding proteins and the residues involved in the contacts is important for understanding the DNA-binding mechanism in proteins. Moreover, it has been reported in the literature that the mutations of some DNA-binding residues on proteins are associated with some diseases. The identification of these proteins and their binding mechanism generally require experimental techniques, which makes large scale study extremely difficult. Thus, the prediction of DNA-binding proteins and their binding sites from sequences alone is one of the most challenging problems in the field of genome annotation. Since the start of the human genome project, many attempts have been made to solve the problem with different approaches, but the accuracy of these methods is still not suitable to do large scale annotation of proteins. Rather than relying solely on the existing machine learning techniques, I sought to combine those using novel “stacking technique” and used the problem-specific architectures to solve the problem with better accuracy than the existing methods. This thesis presents a possible solution to the DNA-binding proteins prediction problem which performs better than the state-of-the-art approaches

    Molecular Phylogeny of OVOL Genes Illustrates a Conserved C2H2 Zinc Finger Domain Coupled by Hypervariable Unstructured Regions

    Get PDF
    OVO-like proteins (OVOL) are members of the zinc finger protein family and serve as transcription factors to regulate gene expression in various differentiation processes. Recent studies have shown that OVOL genes are involved in epithelial development and differentiation in a wide variety of organisms; yet there is a lack of comprehensive studies that describe OVOL proteins from an evolutionary perspective. Using comparative genomic analysis, we traced three different OVOL genes (OVOL1-3) in vertebrates. One gene, OVOL3, was duplicated during a whole-genome-duplication event in fish, but only the copy (OVOL3b) was retained. From early-branching metazoa to humans, we found that a core domain, comprising a tetrad of C2H2 zinc fingers, is conserved. By domain comparison of the OVOL proteins, we found that they evolved in different metazoan lineages by attaching intrinsically-disordered (ID) segments of N/C-terminal extensions of 100 to 1000 amino acids to this conserved core. These ID regions originated independently across different animal lineages giving rise to different types of OVOL genes over the course of metazoan evolution. We illustrated the molecular evolution of metazoan OVOL genes over a period of 700 million years (MY). This study both extends our current understanding of the structure/function relationship of metazoan OVOL genes, and assembles a good platform for further characterization of OVOL genes from diverged organisms

    Computational Investigations of Backbone Dynamics in Intrinsically Disordered Proteins

    Get PDF
    Intrinsically disordered proteins (IDPs), due to their dynamic nature, play important roles in molecular recognition, signalling, regulation, or binding of nucleic acids. IDPs have been extensively studied computationally in terms of binary disorder/order classification. This approach has proven to be fruitful and enabled researchers to estimate the amount of disorder in prokaryotic and eukaryotic genomes. Other computational methods – molecular dynamics, or other simulation techniques, require a starting structure. However, there are no approaches permitting insight into the behaviour of disordered ensembles from sequence alone. Such a method would facilitate the study of proteins of unknown structures, help to obtain a better classification of the disordered regions, and the design disorder-to-order transitions. In this work, I develop FRAGFOLD-IDP, a method to address this issue. Using a fragment-based structure prediction approach – FRAGFOLD, I generate the ensembles of IDPs and show that the features extracted from them correspond well with the backbone dynamics of NMR ensembles deposited in the PDB. FRAGFOLD-IDP predictions significantly improve over a naïve approach and help to get a better insight into the dynamics of the disordered ensembles. The results also show it is not necessary to predict the correct fold of the protein to reliably assign per-residue fluctuations to the sequence in question. This suggests that disorder is a local property and it does not depend on the protein fold. Next, I validate FRAGFOLD-IDP on the disorder classification task and show that the method performs comparably to machine learning-based approaches designed specifically for this task. I also found that FRAGFOLD-IDP produces results on par with DynaMine, a machine learning approach to predict the NMR order parameters and that the results of both methods are not correlated. Thus, I constructed a consensus neural network predictor, which takes the results of FRAGFOLD-IDP, DynaMine and physicochemical features to predict per-residue fluctuations, improving upon both input methods
    • …
    corecore