4 research outputs found
Image-based Automated Chemical Database Annotation with Ensemble of Machine-Vision Classifiers
This paper presents an image-based annotation strategy for automated annotation of chemical databases. The proposed strategy is based on the use of a machine vision-based classifier for extracting a 2D chemical structure diagram in research articles and converting them into standard chemical file formats, a virtual Chemical Expert" system for screening the converted structures based on the level of estimated conversion accuracy, and a fragment-based measure for calculation intermolecular similarity. In particular, in order to overcome limited accuracies of individual machine-vision classifier, inspired by ensemble methods in machine learning, it is attempted to use of the ensemble of machine-vision classifiers. For annotation, calculated chemical similarity between the converted structures and entries in a virtual small molecule database is used to establish the links. Annotation test to link 121 journal articles to entries in PubChem database demonstrates that ensemble approach increases the coverage of annotation, while keeping the annotation quality (e.g., recall and precision rates) comparable to using a single machine-vision classifier.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/87266/4/Saitou55.pd
Molecular access to multi-dimensionally encoded information
Polymer scientist have only recently realized that information storage on the molecular level is not only restricted to DNA-based systems. Similar encoding and decoding of data have been demonstrated on synthetic polymers that could overcome some of the drawbacks associated with DNA, such as the ability to make use of a larger monomer alphabet. This feature article describes some of the recent data storage strategies that were investigated, ranging from writing information on linear sequence-defined macromolecules up to layer-by-layer casted surfaces and QR codes. In addition, some strategies to increase storage density are elaborated and some trends regarding future perspectives on molecular data storage from the literature are critically evaluated. This work ends with highlighting the demand for new strategies setting up reliable solutions for future data management technologies
Recommended from our members
Information extraction from chemical patents
The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature.
Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use.
PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information.
Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.Unileve