Search CORE

1,198 research outputs found

Large-scale prediction of long disordered regions in proteins using random forests

Author: Feng Zhi-Ping
Han Pengfei
Norton Raymond S
Zhang Xiuzhen
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: Many proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies. Results: A new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes. Conclusion: The random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php

Crossref

Springer - Publisher Connector

PubMed Central

RMIT Research Repository

University of Melbourne Institutional Repository

Identification of RNA Binding Proteins and RNA Binding Residues Using Effective Machine Learning Techniques

Author: Khanal Reecha
Publication venue: ScholarWorks@UNO
Publication date: 01/04/2019
Field of study

Identification and annotation of RNA Binding Proteins (RBPs) and RNA Binding residues from sequence information alone is one of the most challenging problems in computational biology. RBPs play crucial roles in several fundamental biological functions including transcriptional regulation of RNAs and RNA metabolism splicing. Existing experimental techniques are time-consuming and costly. Thus, efficient computational identification of RBPs directly from the sequence can be useful to annotate RBP and assist the experimental design. Here, we introduce AIRBP, a computational sequence-based method, which utilizes features extracted from evolutionary information, physiochemical properties, and disordered properties to train a machine learning method designed using stacking, an advanced machine learning technique, for effective prediction of RBPs. Furthermore, it makes use of efficient machine learning algorithms like Support Vector Machine, Logistic Regression, K-Nearest Neighbor and XGBoost (Extreme Gradient Boosting Algorithm). In this research work, we also propose another predictor for efficient annotation of RBP residues. This RBP residue predictor also uses stacking and evolutionary algorithms for efficient annotation of RBPs and RNA Binding residue. The RNA-binding residue predictor also utilizes various evolutionary, physicochemical and disordered properties to train a robust model. This thesis presents a possible solution to the RBP and RNA binding residue prediction problem through two independent predictors, both of which outperform existing state-of-the-art approaches

University of New Orleans

Identification of RNA Binding Proteins and RNA Binding Residues Using Effective Machine Learning Techniques

Author: Khanal Reecha
Publication venue: ScholarWorks@UNO
Publication date: 01/04/2019
Field of study

The accurate prediction of disordered regions in protein sequences using machine learning approaches

Author: Han P
Publication venue: RMIT University
Publication date: 01/01/2011
Field of study

A major challenge in the post-genome era is to determine the function of proteins. The traditional structure-function paradigm assumes that the function of a protein is contingent on it folding into a stable three-dimensional structure. However many proteins contain intrinsic unstructured or Disordered Regions (DRs) under physiological conditions, and yet they still carry important functions. Determination of the disordered regions in proteins is therefore an important step towards the determination of their functions. Traditional experimental approaches are generally time consuming and expensive. The efficient and cost-effective computer aided automatic prediction of DRs is thus an attractive alternative. To this end, we propose the novel application of machine learning models and physicochemical features extracted from protein sequences for predicting long, short and global disorder in proteins. To improve the understandability of disorder prediction, rule based predictors are proposed, which are not only able to predict DRs, but can also quantify previously unknown associations between order disorder status and sequences. The prediction process is transparent and simple to explain. As DRs of different lengths possess different properties, to achieve a high accuracy of prediction, we propose predictors specific to long, short and global disorder prediction. These predictors are distinct from each other in terms of their features, the machine learning models used, and the methods of prediction. We thoroughly investigate the database of physicochemical properties of amino acid indices and select the indices most correlated with disorder. Based on these properties, novel feature transforms including autocorrelation and wavelet transforms (WTs) are applied to DR prediction. According to the results of cross-validation tests, our long DR predictor based on autocorrelation achieves the highest accuracy of prediction among long DR predictors at an AUC (Area Under ROC Curve) value of 89.5%. A short DR predictor based on WTs achieves an AUC value of 88.7%, which is comparable to the most accurate short DR predictors. The global DR predictor achieves an AUC value of 96.1%, close to the optimal value. A major bottleneck of large scale DR prediction is the time efficiency constraint that is attributed to slow feature generation stages and complicated prediction methods. Both our long and short DR predictors are built from simple methods of prediction and feature space. Our web service for long DR prediction can process an uploaded file of multiple sequences

RMIT Research Repository

Machine Learning based Protein Sequence to (un)Structure Mapping and Interaction Prediction

Author: Iqbal Sumaiya
Publication venue: ScholarWorks@UNO
Publication date: 09/08/2017
Field of study

Proteins are the fundamental macromolecules within a cell that carry out most of the biological functions. The computational study of protein structure and its functions, using machine learning and data analytics, is elemental in advancing the life-science research due to the fast-growing biological data and the extensive complexities involved in their analyses towards discovering meaningful insights. Mapping of protein’s primary sequence is not only limited to its structure, we extend that to its disordered component known as Intrinsically Disordered Proteins or Regions in proteins (IDPs/IDRs), and hence the involved dynamics, which help us explain complex interaction within a cell that is otherwise obscured. The objective of this dissertation is to develop machine learning based effective tools to predict disordered protein, its properties and dynamics, and interaction paradigm by systematically mining and analyzing large-scale biological data. In this dissertation, we propose a robust framework to predict disordered proteins given only sequence information, using an optimized SVM with RBF kernel. Through appropriate reasoning, we highlight the structure-like behavior of IDPs in disease-associated complexes. Further, we develop a fast and effective predictor of Accessible Surface Area (ASA) of protein residues, a useful structural property that defines protein’s exposure to partners, using regularized regression with 3rd-degree polynomial kernel function and genetic algorithm. As a key outcome of this research, we then introduce a novel method to extract position specific energy (PSEE) of protein residues by modeling the pairwise thermodynamic interactions and hydrophobic effect. PSEE is found to be an effective feature in identifying the enthalpy-gain of the folded state of a protein and otherwise the neutral state of the unstructured proteins. Moreover, we study the peptide-protein transient interactions that involve the induced folding of short peptides through disorder-to-order conformational changes to bind to an appropriate partner. A suite of predictors is developed to identify the residue-patterns of Peptide-Recognition Domains from protein sequence that can recognize and bind to the peptide-motifs and phospho-peptides with post-translational-modifications (PTMs) of amino acid, responsible for critical human diseases, using the stacked generalization ensemble technique. The involved biologically relevant case-studies demonstrate possibilities of discovering new knowledge using the developed tools

University of New Orleans

A novel scoring function for discriminating hyperthermophilic and mesophilic proteins with application to predicting relative thermostability of protein mutants

Author: Fang Jianwen
Li Yunqi
Middaugh C Russell
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The ability to design thermostable proteins is theoretically important and practically useful. Robust and accurate algorithms, however, remain elusive. One critical problem is the lack of reliable methods to estimate the relative thermostability of possible mutants. Results We report a novel scoring function for discriminating hyperthermophilic and mesophilic proteins with application to predicting the relative thermostability of protein mutants. The scoring function was developed based on an elaborate analysis of a set of features calculated or predicted from 540 pairs of hyperthermophilic and mesophilic protein ortholog sequences. It was constructed by a linear combination of ten important features identified by a feature ranking procedure based on the random forest classification algorithm. The weights of these features in the scoring function were fitted by a hill-climbing algorithm. This scoring function has shown an excellent ability to discriminate hyperthermophilic from mesophilic sequences. The prediction accuracies reached 98.9% and 97.3% in discriminating orthologous pairs in training and the holdout testing datasets, respectively. Moreover, the scoring function can distinguish non-homologous sequences with an accuracy of 88.4%. Additional blind tests using two datasets of experimentally investigated mutations demonstrated that the scoring function can be used to predict the relative thermostability of proteins and their mutants at very high accuracies (92.9% and 94.4%). We also developed an amino acid substitution preference matrix between mesophilic and hyperthermophilic proteins, which may be useful in designing more thermostable proteins. Conclusions We have presented a novel scoring function which can distinguish not only HP/MP ortholog pairs, but also non-homologous pairs at high accuracies. Most importantly, it can be used to accurately predict the relative stability of proteins and their mutants, as demonstrated in two blind tests. In addition, the residue substitution preference matrix assembled in this study may reflect the thermal adaptation induced substitution biases. A web server implementing the scoring function and the dataset used in this study are freely available at <url>http://www.abl.ku.edu/thermorank/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

KU ScholarWorks

PubMed Central

Recommended from our members

Predicting Function and Structure using Bioinformatics Protocols:Study of the Intracellular Regions of the Jagged and Delta Protein Families

Author: Ivanova Neli
Publication venue
Publication date: 06/07/2007
Field of study

The type I membrane-spanning proteins Jagged (Jagged-i and -2) and Delta (Delta-l, - 3 and -4) are the human ligands of Notch receptors, which mediate key signaling events in cell differentiation and morphogenesis. The Jagged and Delta proteins are composed of a relatively large extracellular region and of a 100-150 residue, yet uncharacterized cytoplasmic tail, which has been recently found to be important in Notch bi-directional signaling. We applied bioinformatics methods to analyze the intracellular region of human Notch ligands, and to predict their structural and functional properties. We searched databases for orthologues, and found that while the intracellular region is evolutionaiy well conserved within the same ligand type, a wide variability is observed in different ligands. No significant similarity was found between the intracellular region of Jagged and Delta and proteins of known 3D structure. Globularity and disorder predictions indeed suggest that these regions are largely unstructured. However, secondary structure predictions show that these regions have some propensity to form local secondary structure elements. Functional predictions based on pattern recognition imply that the specificity in the Notch machinery response might be related to specific post-translational modifications and binding motifs in the ligand cytoplasmic tail, rather than to specific interactions between the receptors and the extracellular region of the ligands. We also speculate that, given the unusual amino acid composition, the cytoplasmic tail of Jagged and Delta might be involved in zinc binding

Open Research Online (The Open University)

Globular and disordered-the non-identical twins in protein-protein interactions

Author: Kragelund Birthe Brandt
Olsen Johan Gotthardt
Teilum Kaare
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2015
Field of study

In biology proteins from different structural classes interact across and within classes in ways that are optimized to achieve balanced functional outputs. The interactions between intrinsically disordered proteins (IDPs) and other proteins rely on changes in flexibility and this is seen as a strong determinant for their function. This has fostered the notion that IDP’s bind with low affinity but high specificity. Here we have analyzed available detailed thermodynamic data for protein-protein interactions to put to the test if the thermodynamic profiles of IDP interactions differ from those of other protein-protein interactions. We find that ordered proteins and the disordered ones act as non identical twins operating by similar principles but where the disordered proteins complexes are on average less stable by 2.5 kcal mol-1

Directory of Open Access Journals

Copenhagen University Research Information System

Frontiers - Publisher Connector

PubMed Central

Recommended from our members

How codon choice determines evolvability and evolutionary robustness in short linear motifs

Author: Gunnarsson Peter Alexander
Publication venue: University of Cambridge
Publication date: 07/10/2019
Field of study

Short linear motifs, made up of 2-10 amino acids in linear sequence space, are a central component of cellular decision making through proteins. They form a modular system in cells where combinations of domains and motifs are used as basic functional building blocks through interactions. Functions mediated through these motifs include cellular localisation, post-translational modifications, degradation and general protein-protein interactions. Since motifs are made up of a small number of amino acids they have unusual evolutionary properties, for instance they can evolve de novo, or be lost, through a small number of substitutions. This is of particular importance in pathogens such as viruses. Many viruses evolve new host-like motifs to interact with the host and change the regulation and signalling landscape within host cells to mediate infection. In this body of work, I have used influenza as a model to elucidate aspects of the evolutionary properties of motifs. I have been able to leverage recent progress made in determining nucleotide mutation rates and have developed a model for motif evolution that is defined from the nucleotide and codon levels. Simulations using this methodology suggested that different codons have varying propensities to evolve into amino acids within a linear motif. In other words, some sequences have higher motif evolvability. The simulations also indicated a fitness benefit to use some codons over others to encode linear motifs, due to the varying propensity to evolve. These findings suggest that motifs that are encoded by specific codons have higher motif evolutionary robustness, i.e. they can tolerate more mutations without affecting function. I went on to investigate if these predicted properties have played a role in motif evolution in influenza. I found that conserved motifs in influenza use the codons inferred to have higher evolutionary robustness. This would lead to increased fitness, as motifs are less often lost through mutations. I also found that this mutational robustness acts on stop codon usage in influenza, suggesting an explanation for an old observation of predominant use of TAA in many organisms. Interestingly, it also appears that evolutionary robustness of a motif can be varied to tune the rate of motif change, which influenza utilises in glycosylation motifs that interface with the host immune system. Finally, I investigated whether the codon choice and evolvability at early stages of viral host shifts could be used to predict the emergence of functional motifs. I have found that motif evolvability can aid the prediction of motif emergence. For influenza strains H1N1 and H3N2, which were introduced in the human population from birds during the 1900s, the sequence of the early strains could be used to predict the majority of the glycosylation sites that would emerge the following decades. The predictability of motif emergence could have important implications for vaccination efforts. The methodologies developed here, and the observations made about how motif evolution is shaped by codon choices in a predictable way will be important for a better understanding of the evolution of complexity and regulation involving motifs. This may have implications for complex diseases such as cancers, and for our understanding of the evolution of pathogen innovations and functionality

Apollo (Cambridge)