158,029 research outputs found
Recommended from our members
Extraction of chemical structures and reactions from the literature
The ever increasing quantity of chemical literature necessitates
the creation of automated techniques for extracting relevant information.
This work focuses on two aspects: the conversion of chemical names to
computer readable structure representations and the extraction of chemical
reactions from text.
Chemical names are a common way of communicating chemical structure
information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an
open source, freely available algorithm for converting chemical names to
structures was developed. OPSIN employs a regular grammar to direct
tokenisation and parsing leading to the generation of an XML parse tree.
Nomenclature operations are applied successively to the tree with many
requiring the manipulation of an in-memory connection table representation
of the structure under construction. Areas of nomenclature supported are
described with attention being drawn to difficulties that may be
encountered in name to structure conversion. Results on sets of generated
names and names extracted from patents are presented. On generated names,
recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9%
on precision with all results either being comparable or superior to the
tested commercial solutions. On the patent names OPSIN s recall was 2-10%
higher than the tested solutions when the patent names were processed as
found in the patents. The uses of OPSIN as a web service and as a tool for
identifying chemical names in text are shown to demonstrate the direct
utility of this algorithm.
A software system for extracting chemical reactions from the text of
chemical patents was developed. The system relies on the output of
ChemicalTagger, a tool for tagging words and identifying phrases of
importance in experimental chemistry text. Improvements to this tool
required to facilitate this task are documented. The structure of chemical
entities are where possible determined using OPSIN in conjunction with a
dictionary of name to structure relationships. Extracted reactions are
atom mapped to confirm that they are chemically consistent. 424,621 atom
mapped reactions were extracted from 65,034 organic chemistry USPTO
patents. On a sample of 100 of these extracted reactions chemical entities
were identified with 96.4% recall and 88.9% precision. Quantities could be
associated with reagents in 98.8% of cases and 64.9% of cases for products
whilst the correct role was assigned to chemical entities in 91.8% of
cases. Qualitatively the system captured the essence of the reaction in
95% of cases. This system is expected to be useful in the creation of
searchable databases of reactions from chemical patents and in
facilitating analysis of the properties of large populations of reactions
Recommended from our members
Information extraction from chemical patents
The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature.
Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use.
PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large-scale and automated extraction of high-quality reaction information.
Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community.Unileve
A text-mining system for extracting metabolic reactions from full-text articles
Background: Increasingly biological text mining research is focusing on the extraction of complex relationships
relevant to the construction and curation of biological networks and pathways. However, one important category of
pathway—metabolic pathways—has been largely neglected.
Here we present a relatively simple method for extracting metabolic reaction information from free text that scores
different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence
and location of stemmed keywords. This method extends an approach that has proved effective in the context of the
extraction of protein–protein interactions.
Results: When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our
method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the
well-known protein-protein interaction extraction task.
Conclusions: We conclude that automated metabolic pathway construction is more tractable than has often been
assumed, and that (as in the case of protein–protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed
Early maturation processes in coal. Part 1: Pyrolysis mass balances and structural evolution of coalified wood from the Morwell Brown Coal seam
In this work, we develop a theoretical approach to evaluate maturation
process of kerogen-like material, involving molecular dynamic reactive modeling
with a reactive force field to simulate the thermal stress. The Morwell coal
has been selected to study the thermal evolution of terrestrial organic matter.
To achieve this, a structural model is first constructed based on models from
the literature and analytical characterization of our samples by modern 1-and
2-D NMR, FTIR, and elemental analysis. Then, artificial maturation of the
Morwell coal is performed at low conversions in order to obtain, quantitative
and qualitative, detailed evidences of structural evolution of the kerogen upon
maturation. The observed chemical changes are a defunctionalization of the
carboxyl, carbonyl and methoxy functional groups coupling with an increase of
cross linking in the residual mature kerogen. Gaseous and liquids hydrocarbons,
essentially CH4, C4H8 and C14+ liquid hydrocarbons, are generated in low
amount, merely by cleavage of the lignin side chain
Alkali release from aggregates in long-service concrete structures. Laboratory test evaluation and ASR prediction
Il lavoro propone un semplice modello per la previsione dello sviluppo di espansione deleteria da reazione alcali-silice (ASR) in strutture di calcestruzzo progettate per lunga vita di servizio. Il modello è basato su parametri di composizione e di reattività legati alla ASR, compreso il contributo in alcali a lungo termine da parte degli aggregati. Questo contributo è stato stimato attraverso una prova di estrazione di laboratorio, appositamente sviluppata con lo scopo di massimizzare il rilascio in tempi di prova relativamente brevi e con basso rapporto soluzione lisciviante/aggregato. Il metodo di prova proposto è basato sullo standard italiano riportato nella norma UNI 11417-2 e consiste nel sottoporre l'aggregato a lisciviazione con una soluzione satura di idrossido di calcio a 105°C, in autoclave. Sono stati sottoposti a prova nove aggregati (sette sabbie e due aggregati grossi), il rapporto in peso lisciviante/aggregato era pari a 0,6, il rapporto Ca(OH)2 solida/aggregato era pari a 0,05 ed il tempo di prova 120 ore. I risultati delle prove sono stati utilizzati nel modello di previsione dell'espansione deleteria a lungo termine, ottenendo delle previsioni del tutto congruenti con le informazioni sul comportamento reale dei materiali, nonché con le raccomandazioni riportate nel CEN/TR 16349:2012.This paper proposes a simple model for predicting the development of deleterious expansion from alkali-silica reaction (ASR) in long-service concrete structures. This model is based on some composition and reactivity parameters related to ASR, including the long-term alkali contribution by aggregates to concrete structures. This alkali contribution was estimated by means of a laboratory extraction test, appositely developed in this study in order to maximize the alkali extraction within relatively short testing times and with low leaching solution/aggregate ratios. The proposed test is a modification of the Italian Standard test method UNI 11417-2 (Ente Nazionale Italiano di Normazione) and it consists of subjecting an aggregate sample to leaching with saturated calcium hydroxide solution in a laboratory autoclave at 105 degrees C. Nine natural ASR-susceptible aggregates (seven sands and two coarse aggregates) were tested and the following optimized test conditions were found: leaching solution/aggregate weight ratio = 0.6; solid calcium hydroxide/aggregate weight ratio = 0.05; test duration = 120 h. The results of the optimized alkali extraction tests were used in the proposed model for predicting the potential development of long-term ASR expansion in concrete dams. ASR predictions congruent with both the field experience and the ASR prevention criteria recommended by European Committee for Standardization Technical Report CEN/TR 16349: 2012 were found, thus indicating the suitability of the proposed model
The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures
Materials science literature contains millions of materials synthesis
procedures described in unstructured natural language text. Large-scale
analysis of these synthesis procedures would facilitate deeper scientific
understanding of materials synthesis and enable automated synthesis planning.
Such analysis requires extracting structured representations of synthesis
procedures from the raw text as a first step. To facilitate the training and
evaluation of synthesis extraction models, we introduce a dataset of 230
synthesis procedures annotated by domain experts with labeled graphs that
express the semantics of the synthesis sentences. The nodes in this graph are
synthesis operations and their typed arguments, and labeled edges specify
relations between the nodes. We describe this new resource in detail and
highlight some specific challenges to annotating scientific text with shallow
semantic structure. We make the corpus available to the community to promote
further research and development of scientific information extraction systems.Comment: Accepted as a long paper at the Linguistic Annotation Workshop (LAW)
at ACL 201
Cellulosic materials as biopolymers and supercritical CO2as a green process: chemistry and applications
In this review, we describe the use of supercritical CO2 (scCO2) in several cellulose applications. The focus is on different technologies that either exist or are expected to emerge in the near future. The applications are wide from the extraction of hazardous wastes to the cleaning and reuse of paper or production of glucose. To put this topic in context, cellulose chemistry and its interactions with scCO2 are described. The aim of this study was to discuss the new emerging technologies and trends concerning cellulosic materials processed in scCO2 such as cellulose drying to obtain aerogels, foams and other microporous materials, impregnation of cellulose, extraction of highly valuable compounds from plants and metallic residues from treated wood. Especially, in the bio-fuel production field, we address the pre-treatment of cellulose in scCO2 to improve fermentation to ethanol by cellulase enzymes. Other reactions of cellulosic materials such as organic inorganic composites fabrication and de-polymerisation have been considered. Cellulose treatment by scCO2 has been discussed as well. Finally, other applications like deacidification of paper and cellulosic membranes fabrication in scCO2 have been reviewed. Examples of the discussed technologies are included as well
Retrosynthetic reaction prediction using neural sequence-to-sequence models
We describe a fully data driven model that learns to perform a retrosynthetic
reaction prediction task, which is treated as a sequence-to-sequence mapping
problem. The end-to-end trained model has an encoder-decoder architecture that
consists of two recurrent neural networks, which has previously shown great
success in solving other sequence-to-sequence prediction tasks such as machine
translation. The model is trained on 50,000 experimental reaction examples from
the United States patent literature, which span 10 broad reaction types that
are commonly used by medicinal chemists. We find that our model performs
comparably with a rule-based expert system baseline model, and also overcomes
certain limitations associated with rule-based expert systems and with any
machine learning approach that contains a rule-based expert system component.
Our model provides an important first step towards solving the challenging
problem of computational retrosynthetic analysis
- …