Search CORE

11 research outputs found

Automated extraction of chemical structure information from digital raster images

Author: Lyu Naesung
Nguyen Mandee
Park Jungkap
Rosania Gustavo R.
Saitou Kazuhiro
Shedden Kerby A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: To search for chemical structures in research articles, diagrams or text representing molecules need to be translated to a standard chemical file format compatible with cheminformatic search engines. Nevertheless, chemical information contained in research articles is often referenced as analog diagrams of chemical structures embedded in digital raster images. To automate analog-to-digital conversion of chemical structure diagrams in scientific research articles, several software systems have been developed. But their algorithmic performance and utility in cheminformatic research have not been investigated. Results: This paper aims to provide critical reviews for these systems and also report our recent development of ChemReader -- a fully automated tool for extracting chemical structure diagrams in research articles and converting them into standard, searchable chemical file formats. Basic algorithms for recognizing lines and letters representing bonds and atoms in chemical structure diagrams can be independently run in sequence from a graphical user interface-and the algorithm parameters can be readily changed-to facilitate additional development specifically tailored to a chemical database annotation scheme. Compared with existing software programs such as OSRA, Kekule, and CLiDE, our results indicate that ChemReader outperforms other software systems on several sets of sample images from diverse sources in terms of the rate of correct outputs and the accuracy on extracting molecular substructure patterns. Conclusion: The availability of ChemReader as a cheminformatic tool for extracting chemical structure information from digital raster images allows research and development groups to enrich their chemical structure databases by annotating the entries with published research articles. Based on its stable performance and high accuracy, ChemReader may be sufficiently accurate for annotating the chemical database with links to scientific research articles.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/90875/1/Saitou8.pd

Directory of Open Access Journals

PubMed Central

Deep Blue Documents at the University of Michigan

Large scale chemical patent mining with UIMA and UNICORE

Author: Bergmann Sandra
Klenner Alexander
Romberg Mathilde
Zimmermann Marc
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Juelich Shared Electronic Resources

Computergestützte Informationsbeschaffung und -verwaltung aus wissenschaftlichen Dokumenten = Computer-aided information retrieval and management system from scientific documents

Author: Nguyen Thanh Cam An
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2020
Field of study

KITopen

Detection of IUPAC and IUPAC-like chemical names

Author: C. Kolarik
C. M. Friedrich
Eller
Guzikowski
J. Fluck
Kolarik
M. Hofmann-Apitius
R. Klinger
Steinbeck
Wishart
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Motivation: Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools

Crossref

Fraunhofer-ePrints

PubMed Central

Publications at Bielefeld University

How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry

Author: A Oikawa
AJ Williams
BJ Strasser
C Steinbeck
CA Smith
CF Taylor
D Flaxbart
DB Baker
DL Wheeler
DS Wishart
DW Hill
EL Schymanski
EL Willighagen
F Mu
FH Allen
IV Filippov
J Downing
J Park
J Rhodes
JR McDaniel
LB De Silva
LW Sumner
M Arita
MA Ott
Martin Scholz
Michael Polymenis
O Casher
O Fiehn
O Fiehn
Oliver Fiehn
P Corbett
P Ibison
P Jaiswal
P Murray-Rust
Q Cui
R Apodaca
R Austin
R Caspi
R Guha
R Kidd
S Kuhn
SE Stein
SM Paley
SR Heller
SR Heller
T Kind
T Kind
T Kind
T Kind
Tobias Kind
Y Zhou
Publication venue: Public Library of Science
Publication date
Field of study

Calculating the metabolome size of species by genome-guided reconstruction of metabolic pathways misses all products from orphan genes and from enzymes lacking annotated genes. Hence, metabolomes need to be determined experimentally. Annotations by mass spectrometry would greatly benefit if peer-reviewed public databases could be queried to compile target lists of structures that already have been reported for a given species. We detail current obstacles to compile such a knowledge base of metabolites.As an example, results are presented for rice. Two rice (oryza sativa) subspecies have been fully sequenced, oryza japonica and oryza indica. Several major small molecule databases were compared for listing known rice metabolites comprising PubChem, Chemical Abstracts, Beilstein, Patent databases, Dictionary of Natural Products, SetupX/BinBase, KNApSAcK DB, and finally those databases which were obtained by computational approaches, i.e. RiceCyc, KEGG, and Reactome. More than 5,000 small molecules were retrieved when searching these databases. Unfortunately, most often, genuine rice metabolites were retrieved together with non-metabolite database entries such as pesticides. Overlaps from database compound lists were very difficult to compare because structures were either not encoded in machine-readable format or because compound identifiers were not cross-referenced between databases.We conclude that present databases are not capable of comprehensively retrieving all known metabolites. Metabolome lists are yet mostly restricted to genome-reconstructed pathways. We suggest that providers of (bio)chemical databases enrich their database identifiers to PubChem IDs and InChIKeys to enable cross-database queries. In addition, peer-reviewed journal repositories need to mandate submission of structures and spectra in machine readable format to allow automated semantic annotation of articles containing chemical structures. Such changes in publication standards and database architectures will enable researchers to compile current knowledge about the metabolome of species, which may extend to derived information such as spectral libraries, organ-specific metabolites, and cross-study comparisons

Crossref

Directory of Open Access Journals

PubMed Central

7th German Conference on Chemoinformatics: 25 CIC-Workshop : Goslar, Germany, 6 - 8 November 2011 ; meeting abstracts / Edited by Frank Oellien, Uli Fechner and Thomas Engel

Author: Engel Thomas
Fechner Uli
Oellien Frank
Publication venue
Publication date: 01/05/2012
Field of study

Hochschulschriftenserver - Universität Frankfurt am Main

Conceptualization of Computational Modeling Approaches and Interpretation of the Role of Neuroimaging Indices in Pathomechanisms for Pre-Clinical Detection of Alzheimer Disease

Author: Iyappan Anandhi
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

With swift advancements in next-generation sequencing technologies alongside the voluminous growth of biological data, a diversity of various data resources such as databases and web services have been created to facilitate data management, accessibility, and analysis. However, the burden of interoperability between dynamically growing data resources is an increasingly rate-limiting step in biomedicine, specifically concerning neurodegeneration. Over the years, massive investments and technological advancements for dementia research have resulted in large proportions of unmined data. Accordingly, there is an essential need for intelligent as well as integrative approaches to mine available data and substantiate novel research outcomes. Semantic frameworks provide a unique possibility to integrate multiple heterogeneous, high-resolution data resources with semantic integrity using standardized ontologies and vocabularies for context- specific domains. In this current work, (i) the functionality of a semantically structured terminology for mining pathway relevant knowledge from the literature, called Pathway Terminology System, is demonstrated and (ii) a context-specific high granularity semantic framework for neurodegenerative diseases, known as NeuroRDF, is presented. Neurodegenerative disorders are especially complex as they are characterized by widespread manifestations and the potential for dramatic alterations in disease progression over time. Early detection and prediction strategies through clinical pointers can provide promising solutions for effective treatment of AD. In the current work, we have presented the importance of bridging the gap between clinical and molecular biomarkers to effectively contribute to dementia research. Moreover, we address the need for a formalized framework called NIFT to automatically mine relevant clinical knowledge from the literature for substantiating high-resolution cause-and-effect models

bonndoc – Der Publikationsserver der Universität Bonn

Development of deep learning applications for the automated extraction of chemical information from scientific literature

Author: Brinkhaus Otto
Publication venue
Publication date: 01/01/2023
Field of study

This dissertation focuses on developing deep learning applications for extracting chemical information from scientific literature, particularly targeting the automated recognition of molecular structures in images. DECIMER Segmentation, a novel application, employs a Mask Region-based Convolutional Neural Network (MRCNN) model to segment chemical structures in documents, aided by a mask expansion algorithm, marking a significant advancement in processing chemical literature. The Optical Chemical Structure Recognition (OCSR) tool DECIMER Image Transformer uses an encoder-decoder architecture to convert chemical structure depictions into the machine-readable SMILES format. The model has been trained on over 450 million pairs of images and SMILES representations. Its ability to interpret various depiction styles, including hand-drawn structures, sets a new standard in OCSR. To artificially generate large and diverse OCSR training datasets using multiple cheminformatics toolkits, RanDepict was developed. The diversification of training data ensures robust model generalisation across different chemical structure depictions. A unique dataset of hand-drawn molecule images was created to evaluate the model's performance in interpreting these challenging depictions. This dataset further contributes to the understanding of automated structure recognition from diverse styles. The integration of these technologies led to the creation of DECIMER.ai, an open-source web application that combines segmentation and interpretation tools, allowing users to extract and process chemical information from literature efficiently. The work concludes with a discussion on the significance of open data in advancing molecular informatics, highlighting the potential to broader chemical research domains. By adhering to FAIR data standards and open-source principles, the tools developed for this dissertation are designed for adaptability and future development within the community

Digitale Bibliothek Thüringen

Recommended from our members

Information extraction from chemical patents

Author: Jessop David M
Publication venue: University of Cambridge
Publication date: 15/03/2011
Field of study

The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.Unileve

Apollo (Cambridge)

Identifying chemical entities on literature:a machine learning approach using dictionaries as domain knowledge

Author: Grego Tiago Daniel Pereira, 1983-
Publication venue
Publication date: 01/01/2013
Field of study

Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2013The volume of life science publications, and therefore the underlying biomedical knowledge, are growing at a fast pace. However the manual literature analysis is a slow and painful task. Hence, text mining systems have been developed to automatically locate the relevant information contained in the literature. An essential step in text mining is named entitiy recognition, but the inherent complexity of biomedical entities, such as chemical compounds, makes it difficult to obtain good performances in this task. This thesis proposes methods capable to improve the current performance of chemical entity recognition from text. Hereby a case based method for recognizing chemical entities is proposed and the obtained evaluation results outperform the most widely used methods, based in dictionaries. A lexical similarity based chemical entity resolution method was also developed and allows an efficient mapping of the recognized entities to the ChEBI database. To improve the chemical entity identification results we developed a validation method that exploits the semantic relationships in ChEBI to measure the similarity between the entities found in the text, in order to discriminate between the correctly identified entities that can be validated and identification errors that should be discarded. A machine learning method for entity recognition error is also proposed, which can efectively find recognition errors in rule based systems. The methods were integrated in a system capable of recognizing chemical entities in texts, map them to the ChEBI database, and provide evidence of validation or recognition error for the recognized entities.O volume de publicações científicas nas ciências da vida está a aumentar a um ritmo crescente. Contudo a análise manual da literatura é um processo árduo e moroso, pelo que têm sido desenvolvidos sistemas de prospecção de texto para identificar automaticamente a informação relevante contida na literatura. Um passo essencial em prospecção de texto é a identificação de entidades nomeadas, mas a complexidade inerente às entidades biomédicas, como é o caso dos compostos químicos, torna difícil obter bons desempenhos nesta tarefa. Esta tese propõe métodos para melhorar o desempenho actual do processo de reconhecimento de entidades químicas em texto. Para tal propõe-se um método para reconhecimento de entidades químicas baseado em aprendizagem automática, que obteve resultados superiores aos métodos baseados em dicionários utilizados actualmente. Desenvolveu-se ainda um método baseado em semelhança lexical que realiza o mapeamento de entidades para a ontologia ChEBI. Para melhorar os resultados de identificação de entidades químicas desenvolveu-se um método de validação que explora as relações semânticas do ChEBI para medir a semelhança entre as entidades encontradas no texto, de forma a discriminar as entidades correctamente identificadas dos erros de identificação. Um método de filtragem de erros baseado em aprendizagem automática é também proposto, e foi testado num sistema baseado em regras. Estes métodos foram integrados num sistema capaz de reconhecer as entidades químicas em texto, mapear para o ChEBI, e fornecer evidência para validação ou detecção de erros das entidades reconhecidas.Fundação para a Ciência e a Tecnologia (FCT, SFRH/BD/36015/2007

Universidade de Lisboa: Repositório.UL