3 research outputs found
Recommended from our members
Information extraction from chemical patents
The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature.
Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use.
PatentEye β an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) β is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye β 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information.
Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.Unileve
The computer storage, retrieval and searching of generic structures in chemical patents : the machine-readable representation of generic structures.
The nature of the generic chemical structures found in patents is
described, with a discussion of the types of statement commonly
found in them. The available representations for such structures
are reviewed, with particular note being given to the suitability
of the representation for searching files of such structures.
Requirements for the unambiguous representation of generic
structures in an "ideal" storage and retrieval system are
discussed.
The basic principles of the theory of formal languages are
reviewed, with particular consideration being given to parsing
methods for context-free languages. The Grammar and parsing of
computer programming languages, as an example of artificial
formal languages, is discussed. Applications of formal language
theory to chemistry and information work are briefly reviewed.
GENSAL, a formal language for the unambiguous description of
generic structures from patents, is presented. It is designed to
be intelligible to a chemist or patent agent, yet sufficiently
ABSTRACT
formaLised to be amenabLe to computer anaLysis. DetaiLed
description is given of the facilities it provides for generic
structure representation, and there is discussion of its
Limitations and the principLes behind its design.
A connection-tabLe-based internaL representation for generic
structures, caLLed an ECTR <Extended Connection TabLe
Representation) is presented. It is designed to represent generic
structures unambiguousLy, and to be generated automatically from
structures encoded in GENSAL. It is compared to other proposed
representations, and its implementation using data types of the
programming Language PascaL described.
An interpreter program which generates an ECTR from structures
encoded in a subset of the GENSAL Language is presented. The
principles of its operation are described.
Possible applications of GENSAL outside the area of patent
documentation are discussed, and suggestions made for further
work on the development of a generic structure storage and
retrieval system based on GENSAL and ECTRs