60,872 research outputs found

    BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features

    Get PDF
    Abstract Background Understanding how biomolecules interact is a major task of systems biology. To model protein-nucleic acid interactions, it is important to identify the DNA or RNA-binding residues in proteins. Protein sequence features, including the biochemical property of amino acids and evolutionary information in terms of position-specific scoring matrix (PSSM), have been used for DNA or RNA-binding site prediction. However, PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites in protein sequences. Results In the present study, several new descriptors of evolutionary information have been developed and evaluated for sequence-based prediction of DNA and RNA-binding residues using support vector machines (SVMs). The new descriptors were shown to improve classifier performance. Interestingly, the best classifiers were obtained by combining the new descriptors and PSSM, suggesting that they captured different aspects of evolutionary information for DNA and RNA-binding site prediction. The SVM classifiers achieved 77.3% sensitivity and 79.3% specificity for prediction of DNA-binding residues, and 71.6% sensitivity and 78.7% specificity for RNA-binding site prediction. Conclusions Predictions at this level of accuracy may provide useful information for modelling protein-nucleic acid interactions in systems biology studies. We have thus developed a web-based tool called BindN+ (http://bioinfo.ggc.org/bindn+/) to make the SVM classifiers accessible to the research community

    ESM-NBR: fast and accurate nucleic acid-binding residue prediction via protein language model feature representation and multi-task learning

    Full text link
    Protein-nucleic acid interactions play a very important role in a variety of biological activities. Accurate identification of nucleic acid-binding residues is a critical step in understanding the interaction mechanisms. Although many computationally based methods have been developed to predict nucleic acid-binding residues, challenges remain. In this study, a fast and accurate sequence-based method, called ESM-NBR, is proposed. In ESM-NBR, we first use the large protein language model ESM2 to extract discriminative biological properties feature representation from protein primary sequences; then, a multi-task deep learning model composed of stacked bidirectional long short-term memory (BiLSTM) and multi-layer perceptron (MLP) networks is employed to explore common and private information of DNA- and RNA-binding residues with ESM2 feature as input. Experimental results on benchmark data sets demonstrate that the prediction performance of ESM2 feature representation comprehensively outperforms evolutionary information-based hidden Markov model (HMM) features. Meanwhile, the ESM-NBR obtains the MCC values for DNA-binding residues prediction of 0.427 and 0.391 on two independent test sets, which are 18.61 and 10.45% higher than those of the second-best methods, respectively. Moreover, by completely discarding the time-cost multiple sequence alignment process, the prediction speed of ESM-NBR far exceeds that of existing methods (5.52s for a protein sequence of length 500, which is about 16 times faster than the second-fastest method). A user-friendly standalone package and the data of ESM-NBR are freely available for academic use at: https://github.com/wwzll123/ESM-NBR

    Sequence and structural features of carbohydrate binding in proteins and assessment of predictability using a neural network

    Get PDF
    BACKGROUND: Protein-Carbohydrate interactions are crucial in many biological processes with implications to drug targeting and gene expression. Nature of protein-carbohydrate interactions may be studied at individual residue level by analyzing local sequence and structure environments in binding regions in comparison to non-binding regions, which provide an inherent control for such analyses. With an ultimate aim of predicting binding sites from sequence and structure, overall statistics of binding regions needs to be compiled. Sequence-based predictions of binding sites have been successfully applied to DNA-binding proteins in our earlier works. We aim to apply similar analysis to carbohydrate binding proteins. However, due to a relatively much smaller region of proteins taking part in such interactions, the methodology and results are significantly different. A comparison of protein-carbohydrate complexes has also been made with other protein-ligand complexes. RESULTS: We have compiled statistics of amino acid compositions in binding versus non-binding regions- general as well as in each different secondary structure conformation. Binding propensities of each of the 20 residue types and their structure features such as solvent accessibility, packing density and secondary structure have been calculated to assess their predisposition to carbohydrate interactions. Finally, evolutionary profiles of amino acid sequences have been used to predict binding sites using a neural network. Another set of neural networks was trained using information from single sequences and the prediction performance from the evolutionary profiles and single sequences were compared. Best of the neural network based prediction could achieve an 87% sensitivity of prediction at 23% specificity for all carbohydrate-binding sites, using evolutionary information. Single sequences gave 68% sensitivity and 55% specificity for the same data set. Sensitivity and specificity for a limited galactose binding data set were obtained as 63% and 79% respectively for evolutionary information and 62% and 68% sensitivity and specificity for single sequences. Propensity and other sequence and structural features of carbohydrate binding sites have also been compared with our similar extensive studies on DNA-binding proteins and also with protein-ligand complexes. CONCLUSION: Carbohydrates typically show a preference to bind aromatic residues and most prominently tryptophan. Higher exposed surface area of binding sites indicates a role of hydrophobic interactions. Neural networks give a moderate success of prediction, which is expected to improve when structures of more protein-carbohydrate complexes become available in future

    Kernel-based machine learning protocol for predicting DNA-binding proteins

    Get PDF
    DNA-binding proteins (DNA-BPs) play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Attempts have been made to identify DNA-BPs based on their sequence and structural information with moderate accuracy. Here we develop a machine learning protocol for the prediction of DNA-BPs where the classifier is Support Vector Machines (SVMs). Information used for classification is derived from characteristics that include surface and overall composition, overall charge and positive potential patches on the protein surface. In total 121 DNA-BPs and 238 non-binding proteins are used to build and evaluate the protocol. In self-consistency, accuracy value of 100% has been achieved. For cross-validation (CV) optimization over entire dataset, we report an accuracy of 90%. Using leave 1-pair holdout evaluation, the accuracy of 86.3% has been achieved. When we restrict the dataset to less than 20% sequence identity amongst the proteins, the holdout accuracy is achieved at 85.8%. Furthermore, seven DNA-BPs with unbounded structures are all correctly predicted. The current performances are better than results published previously. The higher accuracy value achieved here originates from two factors: the ability of the SVM to handle features that demonstrate a wide range of discriminatory power and, a different definition of the positive patch. Since our protocol does not lean on sequence or structural homology, it can be used to identify or predict proteins with DNA-binding function(s) regardless of their homology to the known ones

    Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins

    Get PDF
    A method to detect DNA-binding sites on the surface of a protein structure is important for functional annotation. This work describes the analysis of residue patches on the surface of DNA-binding proteins and the development of a method of predicting DNA-binding sites using a single feature of these surface patches. Surface patches and the DNA-binding sites were initially analysed for accessibility, electrostatic potential, residue propensity, hydrophobicity and residue conservation. From this, it was observed that the DNA-binding sites were, in general, amongst the top 10% of patches with the largest positive electrostatic scores. This knowledge led to the development of a prediction method in which patches of surface residues were selected such that they excluded residues with negative electrostatic scores. This method was used to make predictions for a data set of 56 non-homologous DNA-binding proteins. Correct predictions made for 68% of the data set

    Modelling the structure of full-length Epstein-Barr virus nuclear antigen 1

    Get PDF
    Epstein-Barr virus (EBV) is a clinically important human virus associated with several cancers and is the etiologic agent of infectious mononucleosis. The viral nuclear antigen-1 (EBNA1) is central to the replication and propagation of the viral genome and likely contributes to tumourigenesis. We have compared EBNA1 homologues from other primate lymphocryptoviruses (LCV) and found that the central glycine/alanine repeat (GAr) domain, as well as predicted cellular protein (USP7 and CK2) binding sites are present in homologues in the Old World primates, but not the marmoset; suggesting that these motifs may have co-evolved. Using the resolved structure of the C-terminal one third of EBNA1 (homodimerisation and DNA binding domain), we have gone on to develop monomeric and dimeric models in silico of the full length protein. The C-terminal domain is predicted to be structurally highly similar between homologues, indicating conserved function. Zinc could be stably incorporated into the model, bonding with two N-terminal cysteines predicted to facilitate multimerisation. The GAr contains secondary structural elements in the models, while the protein binding regions are unstructured, irrespective of the prediction approach used and sequence origin. These intrinsically disordered regions may facilitate the diversity observed in partner interactions. We hypothsise that the structured GAr could mask the disordered regions, thereby protecting the protein from default degradation. In the dimer conformation, the C-terminal tails of each monomer wrap around a proline-rich protruding loop of the partner monomer, providing dimer stability, a feature which could be exploited in therapeutic design

    Structure and function prediction of human homologue hABH5 of _E. coli_ ALKB5 using in silico approach

    Get PDF
    Newly discovered human homologues of ALKB protein have shown the activity of DNA damaging drugs, used for cancer therapy. Little is known about the structure and function of hABH5, one of the members of this superfamily. Therefore, in the present study we intend to predict its structure and function using various bioinformatics tools. Modeling was done with modeler 9v7 to predict the 3D structure of the hABH5 protein. 3-D model of hABH5, ALKBH5.B99990005.pdb was predicted and evaluated. Validation results showed 96.8% residues in favor and an additional allowed region of the Ramachandran plot. Ligand binding residues prediction showed four ligand clusters, having 25 ligands in cluster 1. Importantly, conserved pattern of Pro158-X-Asp160-Xn-His266 in the functional domain was detected. DNA and RNA binding sites were also predicted in the model. The predicted and validated model of human homologue hABH5 resulting from this study may unveil the mechanism of DNA damage repair in humans and accelerate research on designing appropriate inhibitors, aiding in chemotherapy and cancer related diseases
    • 

    corecore