9 research outputs found

    ATPbind: Accurate Protein–ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons

    No full text
    Protein–ATP interactions are ubiquitous in a wide variety of biological processes. Correctly locating ATP binding sites from protein information is an important but challenging task for protein function annotation and drug discovery. However, there is no method that can optimally identify ATP binding sites for different proteins. In this study, we report a new composite predictor, ATPbind, for ATP binding sites by integrating the outputs of two template-based predictors (i.e., S-SITE and TM-SITE) and three discriminative sequence-driven features of proteins: position specific scoring matrix, predicted secondary structure, and predicted solvent accessibility. In ATPbind, we assembled multiple support vector machines (SVMs) based on a random undersampling technique to cope with the serious imbalance phenomenon between the numbers of ATP binding sites and of non-ATP binding sites. We also constructed a new gold-standard benchmark data set consisting of 429 ATP binding proteins from the PDB database to evaluate and compare the proposed ATPbind with other existing predictors. Starting from a query sequence and predicted I-TASSER models, ATPbind can achieve an average accuracy of 72%, covering 62% of all ATP binding sites while achieving a Matthews correlation coefficient value that is significantly higher than that of other state-of-the-art predictors

    A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction

    No full text
    <div><p>Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at <a href="http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/" target="_blank">http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/</a>.</p></div

    Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm

    No full text
    Identification of the DNA-binding protein (DBP) helps dig out information embedded in the DNA–protein interaction, which is significant to understanding the mechanisms of DNA replication, transcription, and repair. Although existing computational methods for predicting the DBPs based on protein sequences have obtained great success, there is still room for improvement since the sequence-order information is not fully mined in these methods. In this study, a new three-part sequence-order feature extraction (called TPSO) strategy is developed to extract more discriminative information from protein sequences for predicting the DBPs. For each query protein, TPSO first divides its primary sequence features into N- and C-terminal fragments and then extracts the numerical pseudo features of three parts including the full sequence and these two fragments, respectively. Based on TPSO, a novel deep learning-based method, called TPSO-DBP, is proposed, which employs the sequence-based single-view features, the bidirectional long short-term memory (BiLSTM) and fully connected (FC) neural networks to learn the DBP prediction model. Empirical outcomes reveal that TPSO-DBP can achieve an accuracy of 87.01%, covering 85.30% of all DBPs, while achieving a Matthew’s correlation coefficient value (0.741) that is significantly higher than most existing state-of-the-art DBP prediction methods. Detailed data analyses have indicated that the advantages of TPSO-DBP lie in the utilization of TPSO, which helps extract more concealed prominent patterns, and the deep neural network framework composed of BiLSTM and FC that learns the nonlinear relationships between input features and DBPs. The standalone package and web server of TPSO-DBP are freely available at https://jun-csbio.github.io/TPSO-DBP/

    Performance comparisons of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation under <i>Balanced Evaluation</i>.

    No full text
    <p>Performance comparisons of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation under <i>Balanced Evaluation</i>.</p

    Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under <i>MaxMCC Evaluation</i>.

    No full text
    <p>Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under <i>MaxMCC Evaluation</i>.</p

    Compositions of the two benchmark datasets.

    No full text
    <p>* Figures numP, numN in 2-tuple (numP, numN) represent the number of positive (binding residues) and negative (non-binding residues) samples, respectively; <sup>△</sup> Ratio = numN/numP.</p><p>Compositions of the two benchmark datasets.</p

    Performance comparisons between the proposed TargetSOS and other popular predictors for the independent validation dataset of NUC5.

    No full text
    <p>*Data excerpted fdrom <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0107676#pone.0107676-Chen1" target="_blank">[14]</a>.</p><p>Performance comparisons between the proposed TargetSOS and other popular predictors for the independent validation dataset of NUC5.</p

    TransEFVP: A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion

    No full text
    Studying the effect of single amino acid variations (SAVs) on protein structure and function is integral to advancing our understanding of molecular processes, evolutionary biology, and disease mechanisms. Screening for deleterious variants is one of the crucial issues in precision medicine. Here, we propose a novel computational approach, TransEFVP, based on large-scale protein language model embeddings and a transformer-based neural network to predict disease-associated SAVs. The model adopts a two-stage architecture: the first stage is designed to fuse different feature embeddings through a transformer encoder. In the second stage, a support vector machine model is employed to quantify the pathogenicity of SAVs after dimensionality reduction. The prediction performance of TransEFVP on blind test data achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic curve of 0.871, higher than the existing state-of-the-art methods. The benchmark results demonstrate that TransEFVP can be explored as an accurate and effective SAV pathogenicity prediction method. The data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master for academic use
    corecore