9 research outputs found
ATPbind: Accurate Protein–ATP Binding Site Prediction by Combining Sequence-Profiling and Structure-Based Comparisons
Protein–ATP
interactions are ubiquitous in a wide variety
of biological processes. Correctly locating ATP binding sites from
protein information is an important but challenging task for protein
function annotation and drug discovery. However, there is no method
that can optimally identify ATP binding sites for different proteins.
In this study, we report a new composite predictor, ATPbind, for ATP
binding sites by integrating the outputs of two template-based predictors
(i.e., S-SITE and TM-SITE) and three discriminative sequence-driven
features of proteins: position specific scoring matrix, predicted
secondary structure, and predicted solvent accessibility. In ATPbind,
we assembled multiple support vector machines (SVMs) based on a random
undersampling technique to cope with the serious imbalance phenomenon
between the numbers of ATP binding sites and of non-ATP binding sites.
We also constructed a new gold-standard benchmark data set consisting
of 429 ATP binding proteins from the PDB database to evaluate and
compare the proposed ATPbind with other existing predictors. Starting
from a query sequence and predicted I-TASSER models, ATPbind can achieve
an average accuracy of 72%, covering 62% of all ATP binding sites
while achieving a Matthews correlation coefficient value that is significantly
higher than that of other state-of-the-art predictors
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction
<div><p>Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at <a href="http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/" target="_blank">http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/</a>.</p></div
Improving DNA-Binding Protein Prediction Using Three-Part Sequence-Order Feature Extraction and a Deep Neural Network Algorithm
Identification of the DNA-binding protein (DBP) helps
dig out information
embedded in the DNA–protein interaction, which is significant
to understanding the mechanisms of DNA replication, transcription,
and repair. Although existing computational methods for predicting
the DBPs based on protein sequences have obtained great success, there
is still room for improvement since the sequence-order information
is not fully mined in these methods. In this study, a new three-part
sequence-order feature extraction (called TPSO) strategy is developed
to extract more discriminative information from protein sequences
for predicting the DBPs. For each query protein, TPSO first divides
its primary sequence features into N- and C-terminal fragments and
then extracts the numerical pseudo features of three parts including
the full sequence and these two fragments, respectively. Based on
TPSO, a novel deep learning-based method, called TPSO-DBP, is proposed,
which employs the sequence-based single-view features, the bidirectional
long short-term memory (BiLSTM) and fully connected (FC) neural networks
to learn the DBP prediction model. Empirical outcomes reveal that
TPSO-DBP can achieve an accuracy of 87.01%, covering 85.30% of all
DBPs, while achieving a Matthew’s correlation coefficient value
(0.741) that is significantly higher than most existing state-of-the-art
DBP prediction methods. Detailed data analyses have indicated that
the advantages of TPSO-DBP lie in the utilization of TPSO, which helps
extract more concealed prominent patterns, and the deep neural network
framework composed of BiLSTM and FC that learns the nonlinear relationships
between input features and DBPs. The standalone package and web server
of TPSO-DBP are freely available at https://jun-csbio.github.io/TPSO-DBP/
Performance comparisons of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation under <i>Balanced Evaluation</i>.
<p>Performance comparisons of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation under <i>Balanced Evaluation</i>.</p
Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under <i>MaxMCC Evaluation</i>.
<p>Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under <i>MaxMCC Evaluation</i>.</p
Compositions of the two benchmark datasets.
<p>* Figures numP, numN in 2-tuple (numP, numN) represent the number of positive (binding residues) and negative (non-binding residues) samples, respectively; <sup>△</sup> Ratio = numN/numP.</p><p>Compositions of the two benchmark datasets.</p
ROC curves of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation.
<p>(a) ROC curves for ATP168; (b) ROC curves for ATP227.</p
Performance comparisons between the proposed TargetSOS and other popular predictors for the independent validation dataset of NUC5.
<p>*Data excerpted fdrom <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0107676#pone.0107676-Chen1" target="_blank">[14]</a>.</p><p>Performance comparisons between the proposed TargetSOS and other popular predictors for the independent validation dataset of NUC5.</p
TransEFVP: A Two-Stage Approach for the Prediction of Human Pathogenic Variants Based on Protein Sequence Embedding Fusion
Studying the effect of single amino
acid variations (SAVs) on protein structure and function is integral
to advancing our understanding of molecular processes, evolutionary
biology, and disease mechanisms. Screening for deleterious variants
is one of the crucial issues in precision medicine. Here, we propose
a novel computational approach, TransEFVP, based on large-scale protein
language model embeddings and a transformer-based neural network to
predict disease-associated SAVs. The model adopts a two-stage architecture:
the first stage is designed to fuse different feature embeddings through
a transformer encoder. In the second stage, a support vector machine
model is employed to quantify the pathogenicity of SAVs after dimensionality
reduction. The prediction performance of TransEFVP on blind test data
achieves a Matthews correlation coefficient of 0.751, an F1-score of 0.846, and an area under the receiver operating characteristic
curve of 0.871, higher than the existing state-of-the-art methods.
The benchmark results demonstrate that TransEFVP can be explored as
an accurate and effective SAV pathogenicity prediction method. The
data and codes for TransEFVP are available at https://github.com/yzh9607/TransEFVP/tree/master
for academic use