18 research outputs found

    Efficacy of different protein descriptors in predicting protein functional families

    Get PDF
    10.1186/1471-2105-8-300BMC Bioinformatics8-BBMI

    Enzyme classification with peptide programs: a comparative study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Efficient and accurate prediction of protein function from sequence is one of the standing problems in Biology. The generalised use of sequence alignments for inferring function promotes the propagation of errors, and there are limits to its applicability. Several machine learning methods have been applied to predict protein function, but they lose much of the information encoded by protein sequences because they need to transform them to obtain data of fixed length.</p> <p>Results</p> <p>We have developed a machine learning methodology, called peptide programs (PPs), to deal directly with protein sequences and compared its performance with that of Support Vector Machines (SVMs) and BLAST in detailed enzyme classification tasks. Overall, the PPs and SVMs had a similar performance in terms of Matthews Correlation Coefficient, but the PPs had generally a higher precision. BLAST performed globally better than both methodologies, but the PPs had better results than BLAST and SVMs for the smaller datasets.</p> <p>Conclusion</p> <p>The higher precision of the PPs in comparison to the SVMs suggests that dealing with sequences is advantageous for detailed protein classification, as precision is essential to avoid annotation errors. The fact that the PPs performed better than BLAST for the smaller datasets demonstrates the potential of the methodology, but the drop in performance observed for the larger datasets indicates that further development is required.</p> <p>Possible strategies to address this issue include partitioning the datasets into smaller subsets and training individual PPs for each subset, or training several PPs for each dataset and combining them using a bagging strategy.</p

    Predicting Bevirimat resistance of HIV-1 from genotype

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Maturation inhibitors are a new class of antiretroviral drugs. Bevirimat (BVM) was the first substance in this class of inhibitors entering clinical trials. While the inhibitory function of BVM is well established, the molecular mechanisms of action and resistance are not well understood. It is known that mutations in the regions CS p24/p2 and p2 can cause phenotypic resistance to BVM. We have investigated a set of p24/p2 sequences of HIV-1 of known phenotypic resistance to BVM to test whether BVM resistance can be predicted from sequence, and to identify possible molecular mechanisms of BVM resistance in HIV-1.</p> <p>Results</p> <p>We used artificial neural networks and random forests with different descriptors for the prediction of BVM resistance. Random forests with hydrophobicity as descriptor performed best and classified the sequences with an area under the Receiver Operating Characteristics (ROC) curve of 0.93 ± 0.001. For the collected data we find that p2 sequence positions 369 to 376 have the highest impact on resistance, with positions 370 and 372 being particularly important. These findings are in partial agreement with other recent studies. Apart from the complex machine learning models we derived a number of simple rules that predict BVM resistance from sequence with surprising accuracy. According to computational predictions based on the data set used, cleavage sites are usually not shifted by resistance mutations. However, we found that resistance mutations could shorten and weaken the <it>α</it>-helix in p2, which hints at a possible resistance mechanism.</p> <p>Conclusions</p> <p>We found that BVM resistance of HIV-1 can be predicted well from the sequence of the p2 peptide, which may prove useful for personalized therapy if maturation inhibitors reach clinical practice. Results of secondary structure analysis are compatible with a possible route to BVM resistance in which mutations weaken a six-helix bundle discovered in recent experiments, and thus ease Gag cleavage by the retroviral protease.</p

    Machine learning on normalized protein sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.</p> <p>Findings</p> <p>We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.</p> <p>Conclusions</p> <p>We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.</p

    Machine learning based prediction of esterases' promiscuity

    Get PDF
    Els enzims són de gran interès per a la majoria de les indústries, no obstant la seva caracterització en el laboratori és costosa i molt laboriosa, fet que ha impulsat el desenvolupament de tecnologies de predicció de les activitats dels enzims. Malgrat això, els enzims industrials han de tenir unes propietats molt específiques com per exemple alta especificitat, alta activitat en condicions no biològiques i alta promiscuitat, característiques que no estan ben cobertes per les eines de predicció actuals. Per aquest motiu, amb aquest projecte, s'intenta mitigar el problema creant classificadors binaris que poden predir la promiscuitat de les esterases.Enzymes are of great interest for a vast variety of industries; however, the experimental characterization is very time consuming and expensive. Moreover, industrial enzymes need to adapt to nonbiological conditions while maintaining high activity, promiscuity and stereo-selectivity, properties that are not well covered, currently, by prediction technologies which means that their characterization still relies solely on experimentation. This project has the intention of mitigating the problem by developing binary classifiers and multi-classifiers that can predict the promiscuity of esterases, one of the many industrially relevant enzymes

    Prediction of lung tumor types based on protein attributes by machine learning algorithms

    Full text link

    The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases

    Get PDF
    One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs

    Application of Hybrid Functional Groups to Predict ATP Binding Proteins

    Get PDF
    corecore