Article thumbnail
Location of Repository

Detection of IUPAC and IUPAC-like chemical names

By Roman Klinger, Corinna Kolářik, Juliane Fluck, Martin Hofmann-Apitius and Christoph M. Friedrich


Motivation: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools

Topics: Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto
Publisher: Oxford University Press
OAI identifier:
Provided by: PubMed Central

Suggested articles


  1. (2003)Abiological named entity recognizer.
  2. (2007a) Identifying gene specific variations in biomedical text.
  3. (2007b) Named entity recognition with combinations of conditional random fields.
  4. (2007). A reappriasal of sentence and token splitting for life science documents.
  5. (1989). A tutorial on hidden Markov models and selected applications in speech recognition.
  6. (2004). An entity tagger for recognizing acquired genomic variations in cancer literature.
  7. (1993). An Introduction to the Bootstrap.
  8. (2007). Annotation of chemical named entities.
  9. (2007). Available at (last accessed date
  10. (2007). Biocreative 2. gene mention task.
  11. (2006). Biomedical and chemical named entity recognition with conditional random fields: the advantage of dictionary features.
  12. (1997). Chemical markup language: a simple introduction to structured documents.
  13. (2008). Chemical names: terminological resources and corpora annotation.
  14. (2007). Classical Probabilistic Models and Conditional Random Fields.
  15. (2007). com/databases/details/?db=16 (last accessed date
  16. (2005). Combating illiteracy in chemistry: towards computerbased chemical structure reconstruction.
  17. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data.
  18. (2006). Drugbank: a comprehensive resource for in silico drug discovery and exploration.
  19. (2007). Ebimed – text crunching to gather facts for proteins from medline.
  20. (2007). Extraction and search of chemical formulae in text documents on the web.
  21. (2001). Factor graphs and the sum-product algorithm.
  22. (2006). High-throughput identification of chemistry in life science texts.
  23. (2007). Identification of new drug classification terms in textual resources.
  24. (2006). Identifying and classifying terms in the life sciences: the case of chemical terminology.
  25. (2005). Identifying gene and protein mentions in text using conditional random fields.
  26. (2006). Improving the quality of published chemical names with nomenclature software.
  27. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond: Support Vector Machines, Regularization, Optimization and Beyond (Adaptive Computation and Machine Learning).
  28. (2007). Lexichem. Software.Available at toolkits/lexichem.html (last accessed date
  29. (2002). MALLET: a machine learning for language toolkit. Available at (last accessed
  30. (2007). Mining, storage, retrieval: the challenge of integrating chemoinformatics with chemical structure recognition in text and images.
  31. (2007). Name=struct. Software.Available at http://www.cambridgesoft.
  32. (2008). (last accessed date
  33. (2007). Oscar3. Software. Available at, (last accessed date
  34. (2006). Pattern Recognition and Machine Learning.
  35. (2007). Pubchem data. Online. Available at Compound/CURRENT-Full/XML/ (last accessed date
  36. (2007). Reconstruction of chemical molecules from images.
  37. (1988). Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.
  38. (2007). Software.Available at name_lab/name/ (last accessed date
  39. (1997). Synthesis of racemic 6,7,8,9-tetrahydro-1h-1-benzazepine-2,5-diones as antagonists of n-methyl-d-aspartate (nmda) and α-amino-3-hydroxy-5-methylisoxazole-4- propionic acid (ampa) receptors.
  40. (2003). The chemistry development kit (cdk): an open-source java library for chemo- and bioinformatics.
  41. (1998). The extraction of information from the text of chemical patents. 1. identification of specific chemical names.
  42. (2005). Top 50 drugs brand-name prescribed. Available at http://apps. and_services/includes/Top50BrandDrugs.pdf (last accessed date
  43. (2006). Understanding chemical terminology.
  44. (1980). Updating Quasi-Newton matrices with limited storage.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.