27 research outputs found

    Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection

    Get PDF
    <div><p>Protein remote homology detection is one of the most important problems in bioinformatics. Discriminative methods such as support vector machines (SVM) have shown superior performance. However, the performance of SVM-based methods depends on the vector representations of the protein sequences. Prior works have demonstrated that sequence-order effects are relevant for discrimination, but little work has explored how to incorporate the sequence-order information along with the amino acid physicochemical properties into the prediction. In order to incorporate the sequence-order effects into the protein remote homology detection, the physicochemical distance transformation (PDT) method is proposed. Each protein sequence is converted into a series of numbers by using the physicochemical property scores in the amino acid index (AAIndex), and then the sequence is converted into a fixed length vector by PDT. The sequence-order information can be efficiently included into the feature vector with little computational cost by this approach. Finally, the feature vectors are input into a support vector machine classifier to detect the protein remote homologies. Our experiments on a well-known benchmark show the proposed method SVM-PDT achieves superior or comparable performance with current state-of-the-art methods and its computational cost is considerably superior to those of other methods. When the evolutionary information extracted from the frequency profiles is combined with the PDT method, the profile-based PDT approach can improve the performance by 3.4% and 11.4% in terms of ROC score and ROC50 score respectively. The local sequence-order information of the protein can be efficiently captured by the proposed PDT and the physicochemical properties extracted from the amino acid index are incorporated into the prediction. The physicochemical distance transformation provides a general framework, which would be a valuable tool for protein-level study.</p> </div

    The influence of the number of randomly chopped amino acids within the beginning 20 amino acids from the N-terminus of the target proteins on the performance.

    No full text
    <p>The influence of the number of randomly chopped amino acids within the beginning 20 amino acids from the N-terminus of the target proteins on the performance.</p

    The average ROC scores of the profile-based PDT approach with different values of β.

    No full text
    <p>The average ROC scores of the profile-based PDT approach with different values of β.</p

    The flowchart of generating the profile-based protein sequences.

    No full text
    <p>The multiple sequence alignment is obtained by PSI-BLAST. The frequency profile is calculated from the multiple sequence alignment. For each column in the frequency profile, the amino acids are sorted in descending order according to their frequencies, and then the profile-based sequences are obtained by combining the n-th most frequent amino acids.</p

    The average ROC scores of the sequence-based PDT approach with different <i>β</i> values on SCOP 1.53 dataset.

    No full text
    <p>The average ROC scores of the sequence-based PDT approach with different <i>β</i> values on SCOP 1.53 dataset.</p

    Ordered list of discriminative features of SVM-PDT.

    No full text
    <p>List of 10 most discriminative features of four selected families for SVM-PDT. The features are sorted in descending order according to their absolute discriminative weight. For the detailed information of each index shown in this table, please refer to <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046633#pone.0046633.s001" target="_blank">Text S1</a>.</p

    Comparison against the profile-based methods.

    No full text
    *<p>The results of HHsearch are obtained by in-house implementation of the hhsuite package.</p

    iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition

    No full text
    <div><p>Playing crucial roles in various cellular processes, such as recognition of specific nucleotide sequences, regulation of transcription, and regulation of gene expression, DNA-binding proteins are essential ingredients for both eukaryotic and prokaryotic proteomes. With the avalanche of protein sequences generated in the postgenomic age, it is a critical challenge to develop automated methods for accurate and rapidly identifying DNA-binding proteins based on their sequence information alone. Here, a novel predictor, called “iDNA-Prot|dis”, was established by incorporating the amino acid distance-pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) vector. The former can capture the characteristics of DNA-binding proteins so as to enhance its prediction quality, while the latter can reduce the dimension of PseAAC vector so as to speed up its prediction process. It was observed by the rigorous jackknife and independent dataset tests that the new predictor outperformed the existing predictors for the same purpose. As a user-friendly web-server, iDNA-Prot|dis is accessible to the public at <a href="http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/" target="_blank">http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/</a>. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step protocol guide is provided on how to use the web-server to get their desired results without the need to follow the complicated mathematic equations that are presented in this paper just for the integrity of its developing process. It is anticipated that the iDNA-Prot|dis predictor may become a useful high throughput tool for large-scale analysis of DNA-binding proteins, or at the very least, play a complementary role to the existing predictors in this regard.</p></div

    An illustration for discriminant visualization and interpretation.

    No full text
    <p>(A) The discriminative power of the 400 amino acid pairs. Each element in this figure represents the sum score of the features with positive discriminant weights for a specific distance amino acid pair with <i>cp(20)</i>. The amino acids are identified by their one-letter code. The amino acids labelled by horizontal-axis and vertical-axis indicate the first amino acid and the second amino acid in the pairs, respectively. The adjacent colour bar shows the mapping of sum score values. (B) The different discriminant weights of distance amino acid pairs R-R. There are three kinds of features with positive discriminative power for amino acid pair R-R, including RR, R*R, and R**R with distance 1, 2, 3, respectively. (C) The occurrence distribution of RR, R*R, and R**R in the sequence of protein 1HLVA. The total occurrences of the three features are ten, which are shown in red dots. The two DNA-binding regions (sequence position 28–48, and 97–129) are shown in yellow colour. (D) The distribution of RR in the three dimensional structure of 1HLVA. Only one RR occurs outside of the two DNA-binding regions, which was shown in red square. (E) The distribution of R*R and R**R in the three dimensional structure of 1HLVA.</p
    corecore