Article thumbnail

Mining metabolites: extracting the yeast metabolome from the literature

By Chikashi Nobata, Paul D. Dobson, Syed A. Iqbal, Pedro Mendes, Jun’ichi Tsujii, Douglas B. Kell and Sophia Ananiadou


Text mining methods have added considerably to our capacity to extract biological knowledge from the literature. Recently the field of systems biology has begun to model and simulate metabolic networks, requiring knowledge of the set of molecules involved. While genomics and proteomics technologies are able to supply the macromolecular parts list, the metabolites are less easily assembled. Most metabolites are known and reported through the scientific literature, rather than through large-scale experimental surveys. Thus it is important to recover them from the literature. Here we present a novel tool to automatically identify metabolite names in the literature, and associate structures where possible, to define the reported yeast metabolome. With ten-fold cross validation on a manually annotated corpus, our recognition tool generates an f-score of 78.49 (precision of 83.02) and demonstrates greater suitability in identifying metabolite names than other existing recognition tools for general chemical molecules. The metabolite recognition tool has been applied to the literature covering an important model organism, the yeast Saccharomyces cerevisiae, to define its reported metabolome. By coupling to ChemSpider, a major chemical database, we have identified structures for much of the reported metabolome and, where structure identification fails, been able to suggest extensions to ChemSpider. Our manually annotated gold-standard data on 296 abstracts are available as supplementary materials. Metabolite names and, where appropriate, structures are also available as supplementary materials

Topics: Original Article
Publisher: Springer US
OAI identifier:
Provided by: PubMed Central

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.

Suggested articles


  1. (2008). A consensus yeast metabolic reconstruction obtained from a community approach to systems biology.
  2. (2009). A dictionary to identify small molecules and drugs in free text.
  3. (2006). A scalable machine-learning approach to recognize chemical names within large text databases.
  4. (1999). Analysis of biomedical text for chemical names: a comparison of three methods.
  5. (1990). AUTONOM: System for computer translation of structural diagrams into IUPAC-compatible names. 1. General design.
  6. (1991). AUTONOM: System for computer translation of structural diagrams into IUPAC-compatible names. 2. Nomenclature of chains and rings.
  7. (2010). Building a high quality sense inventory for improved abbreviation disambiguation.
  8. (2008). Cascaded classifiers for confidence-based chemical named entity recognition.
  9. (2008). ChEBI: A database and ontology for chemical entities of biological interest.
  10. (2004). Chemical documents: Machine understanding and automated information extraction.
  11. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In
  12. (2002). Corpus-based approach to biological entity recognition. In: Text data mining
  13. (2008). Detection of IUPAC and IUPAC-like chemical names.
  14. (2010). Disambiguating the species of biomedical named entities using natural language parsers.
  15. (2009). Evaluating contributions of natural language parsers to proteinprotein interaction extraction.
  16. (2010). Event extraction for systems biology by text mining the literature. Trends in
  17. (2004). Evolving a lingua franca and associated software infrastructure for computational systems biology: The Systems Biology Markup Language (SBML) project.
  18. (2005). Expansion of the biocyc collection of pathway/ genome databases to 160 genomes.
  19. (2008). Extracting variant forms of chemical names for information retrieval.
  20. (2009). Extraction of cyp chemical interactions from biomedical literature using natural language processing methods.
  21. (2008). FACTA: A text search engine for finding associated biomedical concepts.
  22. (2006). From genomics to chemical genomics: new developments in KEGG.
  23. (2003). GENIA corpus— a semantically annotated corpus for biotextmining.
  24. (2006). High-throughput identification of chemistry in life science texts. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in Bioinformatics).
  25. (2009). HMDB: a knowledgebase for the human metabolome. Nucleic Acids Research 37(Database issue),
  26. (2008). How to make the most of NE dictionaries in statistical NER.
  27. (2005). Implementing the iHOP concept for navigation of biomedical literature.
  28. (2004). Improving the performance of dictionary-based approaches in protein name recognition.
  29. (2006). Improving the quality of published chemical names with nomenclature software.
  30. Introduction to the bio-entity recognition task at
  31. (2009). Journal club. A systems biologist ponders how disparate ideas can sometimes come together beautifully.
  32. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs.
  33. (2000). KEGG: Kyoto encyclopedia of genes and genomes.
  34. (2008). Kleio: a knowledge-enriched information retrieval system for biology.
  35. (2009). Metabolite-likeness’ as a criterion in the design and selection of pharmaceutical drug libraries. Drug Discovery Today,
  36. (2006). Mining chemical structural information from the drug literature. Drug Discovery Today,
  37. (1999). Name = Struct: A practical approach to the sorry state of real-life chemical nomenclature.
  38. (2005). Overview of BioCreative: Critical assement of information extraction for biology.
  39. (1999). Representing text chunks.
  40. (2007). Semantic enrichment of journal articles using chemical named entity recognition (pp.
  41. (2006). Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases.
  42. (2004). Spell checking oriented word lists (SCOWL). Available at
  43. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments.
  44. (2006). Text mining and its potential applications in systems biology.
  45. (2006). Text mining for biology and biomedicine. City: Artech House.
  46. (2010). Text mining meets workflow: linking U-compare with taverna.
  47. (2007). Text processing through web services: Calling whatizit.
  48. (2000). The ENZYME database in 2000.
  49. (2003). The Systems Biology Markup Language (SBML): A medium for representation and exchange of biochemical network models.
  50. (2002). Tuning support vector machines for biomedical named entity recognition.
  51. (2009). U-Compare: Share and compare text mining tools with UIMA.
  52. (2005). Unsupervised gene/protein named entity normalization using automatically extracted dictionaries.
  53. (2002). Use of support vector machines in extended named entity recognition.