34 research outputs found
Special Libraries, May-June 1978
Volume 69, Issue 5-6https://scholarworks.sjsu.edu/sla_sl_1978/1004/thumbnail.jp
The evolution of an on-line chemical search system for an industrial research unit.
The objectives of this study were to design an information
system, using modern computer technology, to meet a research
chemist's need for chemical structural information, to quantify
the effects of increasing degrees of computer technology on the
use made of the facilities, and to relate the use of the service
back to the individual chemist, his performance and background.
A computer system was developed based on Wiswesser Line Notation
and molecular formula as the chemical structure descriptors. Systems design and analysis were performed so that access to the
information could be obtained directly for individual compounds
and more generally for classes of compounds.
As the system was being developed, its use by information staff
was monitored by constant interaction with the people concerned.
Where appropriate, the system was modifiea to meet information
staff's requirements, but a number of precautions had to be
introduced to prevent mis-use.
The research chemists' use of the information services was
studied retrospectively over a two-year period. In addition
to the use made, several other factors were observed for each
chemist. These included performance measures and background
information on the chemists' research role.
The data showed a steady increase in the demand for the services
by the research chemist as the degree of computerisation
increased. The use made of the services related closely to the
number of compounds prepared by each chemist, but there was no
significant correlation between a chemist's success in preparing
biologically active compounds and his information use.
The very individual way in which chemists conduct their research
was highlighted by the wide range of use of the information
facilities and the low correlation with background factors. This
makes the design of on-line systems for use by chemists themselves
complex and justifies the existence of the information scientist
as an interface
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Recommended from our members
Extraction of chemical structures and reactions from the literature
The ever increasing quantity of chemical literature necessitates
the creation of automated techniques for extracting relevant information.
This work focuses on two aspects: the conversion of chemical names to
computer readable structure representations and the extraction of chemical
reactions from text.
Chemical names are a common way of communicating chemical structure
information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an
open source, freely available algorithm for converting chemical names to
structures was developed. OPSIN employs a regular grammar to direct
tokenisation and parsing leading to the generation of an XML parse tree.
Nomenclature operations are applied successively to the tree with many
requiring the manipulation of an in-memory connection table representation
of the structure under construction. Areas of nomenclature supported are
described with attention being drawn to difficulties that may be
encountered in name to structure conversion. Results on sets of generated
names and names extracted from patents are presented. On generated names,
recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9%
on precision with all results either being comparable or superior to the
tested commercial solutions. On the patent names OPSIN s recall was 2-10%
higher than the tested solutions when the patent names were processed as
found in the patents. The uses of OPSIN as a web service and as a tool for
identifying chemical names in text are shown to demonstrate the direct
utility of this algorithm.
A software system for extracting chemical reactions from the text of
chemical patents was developed. The system relies on the output of
ChemicalTagger, a tool for tagging words and identifying phrases of
importance in experimental chemistry text. Improvements to this tool
required to facilitate this task are documented. The structure of chemical
entities are where possible determined using OPSIN in conjunction with a
dictionary of name to structure relationships. Extracted reactions are
atom mapped to confirm that they are chemically consistent. 424,621 atom
mapped reactions were extracted from 65,034 organic chemistry USPTO
patents. On a sample of 100 of these extracted reactions chemical entities
were identified with 96.4% recall and 88.9% precision. Quantities could be
associated with reagents in 98.8% of cases and 64.9% of cases for products
whilst the correct role was assigned to chemical entities in 91.8% of
cases. Qualitatively the system captured the essence of the reaction in
95% of cases. This system is expected to be useful in the creation of
searchable databases of reactions from chemical patents and in
facilitating analysis of the properties of large populations of reactions
Special Libraries, July 1984
Volume 75, Issue 3https://scholarworks.sjsu.edu/sla_sl_1984/1002/thumbnail.jp
Data bases and data base systems related to NASA's aerospace program. A bibliography with indexes
This bibliography lists 1778 reports, articles, and other documents introduced into the NASA scientific and technical information system, 1975 through 1980
Cheminformatics and Computational Approaches for Identifying and Managing Unknown Chemicals in the Environment
In most societies, using chemical products has become a part of daily life. Worldwide, over 350,000 chemicals have been registered for use in e.g., daily household consumption, industrial processes, agriculture, etc. However, despite the benefits chemicals may bring to society, their usage, production, and disposal, which leads to their eventual release into the environment has multiple implications. Anthropogenic chemicals have been detected in myriad ecosystems all over the planet, as well as in the tissues of wildlife and humans. The potential consequences of such chemical pollution are not fully understood, but links to the onset of human disease and threats to biodiversity have been attributed to the presence of chemicals in our environment.
Mitigating the potential negative effects of chemicals typically involves regulatory steps and multiple stakeholders. One key aspect thereof is environmental monitoring, which consists of environmental sampling, measurement, data analysis, and reporting. In recent years, advancements in Liquid Chromatography-High Resolution Mass Spectrometry (LC-HRMS), open chemical databases, and software have enabled researchers to identify known (e.g., pesticides) as well as unknown environmental chemicals, commonly referred to as suspect or non-target compounds. However, identifying unknown chemicals, particularly non-targets, remains extremely challenging because of the lack of a priori knowledge on the analytes - all that is available are their mass spectrometry signals. In fact, the number of unknown features in a typical mass spectrum of an environmental sample is in the range of thousands to tens of thousands, and therefore requires feature prioritisation before identification within a suitable workflow.
In this dissertation work, collaborations with two regulatory authorities responsible for environmental monitoring sought to identify relevant unknown compounds in the environment, specifically by developing computational workflows for unknown identification in LC-HRMS data. The first collaboration culminated in Publication A, which involved a joint project with the Zürcher Amt für Wasser, Energie und Luft. Environmental samples taken from wastewater treatment plant sites in Switzerland were retrospectively analysed using a pre-screening workflow that prioritised features
suitable for non-target identification. For this purpose, a multi-step Quality Control algorithm that checks the quality of mass spectral data in terms of peak intensities, alignment, and signal-to-noise ratio was developed and used within pre-screening. This algorithm was incorporated into the R package Shinyscreen. Features that were prioritised by pre-screening then underwent identification using the in silico fragmentation tool MetFrag. To obtain these identifications, MetFrag was coupled to various open chemical information resources such as spectral databases like MassBank Europe and MassBank of North America, as well as suspect lists from the NORMAN Suspect List Exchange and the CompTox Chemicals Dashboard database. One confirmed and twenty-one tentative compound identifications were achieved and reported according to an established confidence level scheme. Comprehensive data interpretation and detailed communication of MetFrag’s results was performed as a means of formulating evidence-based recommendations that may inform future environmental monitoring campaigns.
Building on the pre-screening and identification workflow developed in Publication A, Publication B resulted from a collaboration with the Luxembourgish Administration de la gestion de l’eau that sought to identify, and where possible quantify unknown chemicals in Luxembourgish surface waters. More specifically, surface water samples collected as part of a two-year national monitoring campaign were measured using LC-HRMS and screened for pharmaceutical parent compounds and their transformation products. Compared to pharmaceutical compound information, which is publicly available from local authorities (and was used in the suspect list), information on transformation products is relatively scarce. Therefore, new approaches were developed in this work to mine data from the PubChem database as well as from the literature in order to formulate a suspect list containing pharmaceutical transformation products, in addition to their parent compounds. Overall, 94 pharmaceuticals and 14 transformation products were identified, of which 88 and 2 were confirmed identifications respectively. The spatio-temporal occurrence and distribution of these compounds throughout the Luxembourgish environment were analysed using advanced data visualisations that highlighted patterns in certain regions and time periods of high incidence. These findings may support future chemicals management measures, particularly in environmental monitoring.
Another challenging aspect of managing chemicals is that they mostly exist as complex mixtures within the environment as well as chemical products. Substances of Unknown or Variable composition, Complex reaction products or Biological materials (UVCBs) make up 20-40% of international chemical registries and include chlorinated paraffins, polymer mixtures, petroleum fractions, and essential oils. However, little is known about their chemical identities and/or compositions, which poses formidable obstacles to assessing their environmental fate and toxicity, let alone identification in the environment. Publication C addresses the challenges of UVCBs by taking an interdisciplinary approach in reviewing the literature that incorporates considerations of their chemical representations, toxicity, environmental fate, exposure, and regulatory approaches. Improved substance registration requirements, grouping techniques to simplify assessment, and the use of Mixture InChI to represent UVCBs in a findable, accessible, interoperable, and reusable (FAIR) way in databases are amongst the key recommendations of this work.
A specific type of UVCB, mixtures of homologous compounds, are commonly detected in environmental samples, including many High Production Volume (HPV) compounds such as surfactants. Compounds forming homologous series are related by a common core fragment and repeating chemical subunit, and can be represented using general formulae (e.g., CnF2n+1COOH) and/or Markush structures. However, a significant identification bottleneck is the inability to match their characteristic analytical signals in LC-HRMS data with chemicals in databases; while comb-like elution patterns and constant differences in mass-to-charge ratio indicate the presence of homologous series in samples, most chemical databases do not contain annotated homologous series. To address this gap, Publication D introduces a cheminformatics algorithm, OngLai, to detect homologous series within compound datasets. OngLai, openly implemented in Python using the RDKit, detects homologous series based on two inputs: a list of compounds and the chemical structure of a repeating unit. OngLai was applied to three open datasets from environmental chemistry, exposomics, and natural products, in which thousands of homologous series with a CH2 repeating unit were detected. Classification of homologous series in compound datasets is expected to advance their analytical detection in samples.
Overall, the work in this dissertation contributed to the advancement of identifying and managing unknown chemicals in the environment using cheminformatics and computational approaches. All work conducted followed Open Science and FAIR data principles: all code, datasets, analyses, and results generated, including the final peer-reviewed publications, are openly available to the public. These efforts are intended to spur further developments in unknown chemical identification and management towards protecting the environment and human health
Play Among Books
How does coding change the way we think about architecture? Miro Roman and his AI Alice_ch3n81 develop a playful scenario in which they propose coding as the new literacy of information. They convey knowledge in the form of a project model that links the fields of architecture and information through two interwoven narrative strands in an “infinite flow” of real books
Play Among Books
How does coding change the way we think about architecture? Miro Roman and his AI Alice_ch3n81 develop a playful scenario in which they propose coding as the new literacy of information. They convey knowledge in the form of a project model that links the fields of architecture and information through two interwoven narrative strands in an “infinite flow” of real books