13 research outputs found

    EFICAz²: enzyme function inference by a combined approach enhanced by machine learning

    Get PDF
    ©2009 Arakaki et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/10/107doi:10.1186/1471-2105-10-107Background: We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment. Results: We have developed two new EFICAz components, analogs to the two FDR-based components, where the discrimination between homo and heterofunctional members is based on the evaluation, via Support Vector Machine models, of all the aligned positions between the query sequence and the multiple sequence alignments associated to the enzyme families. Benchmark results indicate that: i) the new SVM-based components outperform their FDR-based counterparts, and ii) both SVM-based and FDR-based components generate unique predictions. We developed classification tree models to optimally combine the results from the six EFICAz components into a final EC number prediction. The new implementation of our approach, EFICAz², exhibits a highly improved prediction precision at MTTSI < 30% compared to the original EFICAz, with only a slight decrease in prediction recall. A comparative analysis of enzyme function annotation of the human proteome by EFICAz² and KEGG shows that: i) when both sources make EC number assignments for the same protein sequence, the assignments tend to be consistent and ii) EFICAz² generates considerably more unique assignments than KEGG. Conclusion: Performance benchmarks and the comparison with KEGG demonstrate that EFICAz² is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction. The EFICAz² web service is available at: http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.htm

    EFICAz2: enzyme function inference by a combined approach enhanced by machine learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment.</p> <p>Results</p> <p>We have developed two new EFICAz components, analogs to the two FDR-based components, where the discrimination between homo and heterofunctional members is based on the evaluation, via Support Vector Machine models, of all the aligned positions between the query sequence and the multiple sequence alignments associated to the enzyme families. Benchmark results indicate that: i) the new SVM-based components outperform their FDR-based counterparts, and ii) both SVM-based and FDR-based components generate unique predictions. We developed classification tree models to optimally combine the results from the six EFICAz components into a final EC number prediction. The new implementation of our approach, EFICAz<sup>2</sup>, exhibits a highly improved prediction precision at MTTSI < 30% compared to the original EFICAz, with only a slight decrease in prediction recall. A comparative analysis of enzyme function annotation of the human proteome by EFICAz<sup>2 </sup>and KEGG shows that: i) when both sources make EC number assignments for the same protein sequence, the assignments tend to be consistent and ii) EFICAz<sup>2 </sup>generates considerably more unique assignments than KEGG.</p> <p>Conclusion</p> <p>Performance benchmarks and the comparison with KEGG demonstrate that EFICAz<sup>2 </sup>is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction. The EFICAz<sup>2 </sup>web service is available at: <url>http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html</url></p

    High precision multi-genome scale reannotation of enzyme function by EFICAz

    Get PDF
    ©2006 Arakaki et al; licensee BioMed Central Ltd.The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2164/7/315doi:10.1186/1471-2164-7-315Background: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction. Results: Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12). Conclusion: Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction

    Identification of metabolites with anticancer properties by computational metabolomics

    Get PDF
    ©2008 Arakaki et al; licensee BioMed Central Ltd.The electronic version of this article is the complete one and can be found online at: http://www.molecular-cancer.com/content/7/1/57doi:10.1186/1476-4598-7-57Background: Certain endogenous metabolites can influence the rate of cancer cell growth. For example, diacylglycerol, ceramides and sphingosine, NAD+ and arginine exert this effect by acting as signaling molecules, while carrying out other important cellular functions. Metabolites can also be involved in the control of cell proliferation by directly regulating gene expression in ways that are signaling pathway-independent, e.g. by direct activation of transcription factors or by inducing epigenetic processes. The fact that metabolites can affect the cancer process on so many levels suggests that the change in concentration of some metabolites that occurs in cancer cells could have an active role in the progress of the disease. Results: CoMet, a fully automated Computational Metabolomics method to predict changes in metabolite levels in cancer cells compared to normal references has been developed and applied to Jurkat T leukemia cells with the goal of testing the following hypothesis: Up or down regulation in cancer cells of the expression of genes encoding for metabolic enzymes leads to changes in intracellular metabolite concentrations that contribute to disease progression. All nine metabolites predicted to be lowered in Jurkat cells with respect to lymphoblasts that were examined (riboflavin, tryptamine, 3- sulfino-L-alanine, menaquinone, dehydroepiandrosterone, α-hydroxystearic acid, hydroxyacetone, seleno-L-methionine and 5,6-dimethylbenzimidazole), exhibited antiproliferative activity that has not been reported before, while only two (bilirubin and androsterone) of the eleven tested metabolites predicted to be increased or unchanged in Jurkat cells displayed significant antiproliferative activity. Conclusion: These results: a) demonstrate that CoMet is a valuable method to identify potential compounds for experimental validation, b) indicate that cancer cell metabolism may be regulated to reduce the intracellular concentration of certain antiproliferative metabolites, leading to uninhibited cellular growth and c) suggest that many other endogenous metabolites with important roles in carcinogenesis are awaiting discovery

    EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference

    Get PDF
    EFICAz (Enzyme Function Inference by Combined Approach) is an automatic engine for large-scale enzyme function inference that combines predictions from four different methods developed and optimized to achieve high prediction accuracy: (i) recognition of functionally discriminating residues (FDRs) in enzyme families obtained by a Conservation-controlled HMM Iterative procedure for Enzyme Family classification (CHIEFc), (ii) pairwise sequence comparison using a family specific Sequence Identity Threshold, (iii) recognition of FDRs in Multiple Pfam enzyme families, and (iv) recognition of multiple Prosite patterns of high specificity. For FDR (i.e. conserved positions in an enzyme family that discriminate between true and false members of the family) identification, we have developed an Evolutionary Footprinting method that uses evolutionary information from homofunctional and heterofunctional multiple sequence alignments associated with an enzyme family. The FDRs show a significant correlation with annotated active site residues. In a jackknife test, EFICAz shows high accuracy (92%) and sensitivity (82%) for predicting four EC digits in testing sequences that are <40% identical to any member of the corresponding training set. Applied to Escherichia coli genome, EFICAz assigns more detailed enzymatic function than KEGG, and generates numerous novel predictions

    Multimeric Threading-Based Prediction of Protein–Protein Interactions on a Genomic Scale: Application to the Saccharomyces cerevisiae Proteome

    Get PDF
    MULTIPROSPECTOR, a multimeric threading algorithm for the prediction of protein–protein interactions, is applied to the genome of Saccharomyces cerevisiae. Each possible pairwise interaction among more than 6000 encoded proteins is evaluated against a dimer database of 768 complex structures by using a confidence estimate of the fold assignment and the magnitude of the statistical interfacial potentials. In total, 7321 interactions between pairs of different proteins are predicted, based on 304 complex structures. Quality estimation based on the coincidence of subcellular localizations and biological functions of the predicted interactors shows that our approach ranks third when compared with all other large-scale methods. Unlike other in silico methods, MULTIPROSPECTOR is able to identify the residues that participate directly in the interaction. Three hundred seventy-four of our predictions can be found by at least one of the other studies, which is compatible with the overlap between two different other methods. From the analysis of the mRNA abundance data, our method does not bias towards proteins with high abundance. Finally, several relevant predictions involved in various functions are presented. In summary, we provide a novel approach to predict protein–protein interactions on a genomic scale that is a useful complement to experimental methods

    The continuity of protein structure space is an intrinsic property of proteins

    Get PDF
    The classical view of the space of protein structures is that it is populated by a discrete set of protein folds. For proteins up to 200 residues long, by using structural alignments and building upon ideas of the completeness and continuity of structure space, we show that nearly any structure is significantly related to any other using a transitive set of no more than 7 intermediate structurally related proteins. This result holds for all structures in the Protein Data Bank, even when structural relationships between evolutionary related proteins (as detected by threading or functional analyses) are excluded. A similar picture holds for an artificial library of compact, hydrogen-bonded, homopolypeptide structures. The 3 sets share the global connectivity features of random graphs, in which the local connectivity of each node (i.e., the number of neighboring structures per protein) is preserved. This high connectivity supports the continuous view of single-domain protein structure space. More importantly, these results do not depend on evolution, rather just on the physics of protein structures. The fact that evolutionary divergence need not be invoked to explain the continuous nature of protein structure space has implications for how the universe of protein structures might have originated, and how function should be transferred between proteins of similar structure

    Identification of metabolites with anticancer properties by computational metabolomics-0

    No full text
    Lls when enzymes that produce X are upregulated and/or enzymes that consume X are downregulated in cancer cells. (B) The intracellular level of a metabolite X is predicted to be decreased in cancer cells when enzymes that produce X are downregulated and/or enzymes that consume X are upregulated in cancer cells. See Material and Methods for a complete description of the rules.<p><b>Copyright information:</b></p><p>Taken from "Identification of metabolites with anticancer properties by computational metabolomics"</p><p>http://www.molecular-cancer.com/content/7/1/57</p><p>Molecular Cancer 2008;7():57-57.</p><p>Published online 17 Jun 2008</p><p>PMCID:PMC2453147.</p><p></p
    corecore