99 research outputs found
Пользовательский интерфейс для извлечения химико-структурной информации из систематического названия органического соединения
The user's interface «Nomenclature Generator» for extraction of the chemical structure information from the systematic name of organic compound represented according to IUPAC nomenclature is developed at the All-Russian Institute for Scientific and Technical Information of Russian Academy of Sciences.В ВИНИТИ РАН разработан пользовательский интерфейс «Номенклатурный Генератор», предназначенный для автоматического извлечения химико-структурной информации из систематического названия органического соединения, данного в номенклатуре ИЮПАК
Пользовательский интерфейс для извлечения химико-структурной информации из систематического названия органического соединения
В ВИНИТИ РАН разработан пользовательский интерфейс «Номенклатурный Генератор», предназначенный для автоматического извлечения химико-структурной информации из систематического названия органического соединения, данного в номенклатуре ИЮПАК
Recommended from our members
Extraction of chemical structures and reactions from the literature
The ever increasing quantity of chemical literature necessitates
the creation of automated techniques for extracting relevant information.
This work focuses on two aspects: the conversion of chemical names to
computer readable structure representations and the extraction of chemical
reactions from text.
Chemical names are a common way of communicating chemical structure
information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an
open source, freely available algorithm for converting chemical names to
structures was developed. OPSIN employs a regular grammar to direct
tokenisation and parsing leading to the generation of an XML parse tree.
Nomenclature operations are applied successively to the tree with many
requiring the manipulation of an in-memory connection table representation
of the structure under construction. Areas of nomenclature supported are
described with attention being drawn to difficulties that may be
encountered in name to structure conversion. Results on sets of generated
names and names extracted from patents are presented. On generated names,
recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9%
on precision with all results either being comparable or superior to the
tested commercial solutions. On the patent names OPSIN s recall was 2-10%
higher than the tested solutions when the patent names were processed as
found in the patents. The uses of OPSIN as a web service and as a tool for
identifying chemical names in text are shown to demonstrate the direct
utility of this algorithm.
A software system for extracting chemical reactions from the text of
chemical patents was developed. The system relies on the output of
ChemicalTagger, a tool for tagging words and identifying phrases of
importance in experimental chemistry text. Improvements to this tool
required to facilitate this task are documented. The structure of chemical
entities are where possible determined using OPSIN in conjunction with a
dictionary of name to structure relationships. Extracted reactions are
atom mapped to confirm that they are chemically consistent. 424,621 atom
mapped reactions were extracted from 65,034 organic chemistry USPTO
patents. On a sample of 100 of these extracted reactions chemical entities
were identified with 96.4% recall and 88.9% precision. Quantities could be
associated with reagents in 98.8% of cases and 64.9% of cases for products
whilst the correct role was assigned to chemical entities in 91.8% of
cases. Qualitatively the system captured the essence of the reaction in
95% of cases. This system is expected to be useful in the creation of
searchable databases of reactions from chemical patents and in
facilitating analysis of the properties of large populations of reactions
Chemoinformatics approaches for new drugs discovery
Chemoinformatics uses computational methods and technologies to solve chemical problems. It works on molecular structures, their representations, properties and related data. The first and most important phase in this field is the translation of interconnected atomic systems into in-silico models, ensuring complete and correct chemical information transfer. In the last 20 years the chemical databases evolved from the state of molecular repositories to research tools for new drugs identification, while the modern high-throughput technologies allow for continuous chemical libraries size increase as highlighted by publicly available repository like PubChem [http://pubchem.ncbi.nlm.nih.gov/], ZINC [http://zinc.docking.org/], ChemSpider[http://www.chemspider.
com/]. Chemical libraries fundamental requirements are molecular uniqueness, absence of ambiguity, chemical correctness (related to atoms, bonds, chemical orthography), standardized storage and registration formats. The aim of this work is the development of chemoinformatics tools and data for drug discovery process. The first part of the research project was focused on accessible commercial chemical space analysis; looking for molecular redundancy and in-silico models correctness in order to identify a unique and univocal molecular descriptor for chemical libraries indexing. This allows for the 0%-redundancy achievement on a 42 millions compounds library. The protocol was implemented as MMsDusty, a web based tool for molecular databases cleaning. The major protocol developed is MMsINC, a chemoinformatics platform based on a starting number of 4 millions non-redundant high-quality annotated and biomedically relevant chemical structures; the library is now being expanded up to 460 millions compounds. MMsINC is able to perform various types of queries, like substructure or similarity search and descriptors filtering. MMsINC is interfaced with PDB(Protein Data Bank)[http://www.rcsb.org/pdb/home/home.do] and related to approved drugs. The second developed protocol is called pepMMsMIMIC, a peptidomimetic screening tool based on multiconformational chemical libraries; the screening process uses pharmacophoric fingerprints similarity to identify small molecules able to geometrically and chemically mimic endogenous peptides or proteins. The last part of this project lead to the implementation of an optimized and exhaustive conformational space analysis protocol for small molecules libraries; this is crucial for high quality 3D molecular models prediction as requested in chemoinformatics applications. The torsional exploration was optimized in the range of most frequent dihedral angles seen in X-ray solved small molecules structures of CSD(Cambridge Structural Database); by appling this on a 89 millions structures library was generated a library of 2.6 x 10 exp 7 high quality conformers. Tools, protocols and platforms developed in this work allow for chemoinformatics analysis and screening on large size chemical libraries achieving high quality, correct and unique chemical data and in-silico model
Recommended from our members
Automatic Analysis and Validation of the Chemical Literature
ThesisMethods to automatically extract and validate data from the chemical literature in legacy formats to machine-understandable forms are examined. The work focuses of three types of data: analytical data reported in articles, computational chemistry output files and crystallographic information files (CIFs). It is shown that machines are capable of reading and extracting analytical data from the current legacy formats with high recall and precision. Regular expressions cannot identify chemical names with high precision or recall but non-deterministic methods perform significantly better. The lack of machine-understandable connection tables in the literature has been identified as the major issue preventing molecule-based data-driven science being performed in the area. The extraction of data from computational chemistry output files using parser-like approaches is shown to be not generally possible although such methods work well for input files. A hierarchical regular expression based approach can parse > 99:9% of the output files correctly although significant human input is required to prepare the templates. CIFs may be parsed with extremely high recall and precision, contain connection tables and the data is of high quality. The comparison of bond lengths calculated by two computational chemistry programs show good agreement in general but structures containing specific moieties cause discrepancies. An initial protocol for the high-throughput geometry optimisation of molecules extracted from the CIFs is presented and the refinement of this protocol is discussed. Differences in bond length between calculated and experimentally determined values from the CIFs of less than 0.03 Angstrom are shown to be expected by random error. The final protocol is used to find high-quality structures from crystallography which can be reused for further science.Unilever Centre for Molecular Science Informatic
Recommended from our members
Creating a Symbol of Science: The Development of a Standard Periodic Table of the Elements
It is probably a surprise to most people that the periodic table they remember from high school chemistry is not the only periodic table – and never has been. Currently there are probably over a thousand different forms. The table in your chemistry textbook or on the wall chart in your chemistry classroom is not the periodic table. It is simply the most commonly used form. In fact, the International Union of Pure and Applied Chemistry (IUPAC), the international standards-making body for chemistry, has stated that although they encourage the use of this form, they will not endorse any one form of the periodic table as the periodic table. So where did this form come from? How did it come to be the current standard form of the periodic table? Most writing on the periodic table does not address such questions. For what is widely regarded as an icon of science, little is actually known about the origin of its form.
This dissertation aims to answer the questions of how the current standard form of the periodic table was developed and how it came to be ubiquitous in classrooms and textbooks. In it, I highlight the practical nature of chemistry, which influenced not only the development and acceptance of the periodic law but the creation of graphical representations of the periodic system that placed an emphasis on utility rather than art. I examine the role of research and pedagogy in the development of classification schemes for the elements, particularly the periodic system. I argue that the role played by pedagogy was more influential than that of research in the creation of new classification systems and the multiplicity of graphical representations of the periodic law. In the case of the periodic table, research-down theories about pedagogy, in which textbooks are seen merely as codifications of accepted scientific knowledge, do not hold true
Data Base Mapping Model and Search Scheme to Facilitate Resource Sharing: Volume 1, Mapping of Chemical Data Bases and Mapping of Data Base Data Elements Using a Rational Data Base Structure
Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / NSF SIS 74-1855
Recommended from our members
Automated analysis and validation of open chemical data
Methods to automatically extract Open Data from the chemical literature,
validate it, and use it to validate theory are examined.
Chemical identifiers which assist the automatic location of chemical structures
using commercial Web search engines are investigated. The IUPAC
International Chemical Idenfitifer (InChI) gives almost 100% recall and precision,
though is shown to be too long for present search engines. A combination
of InChI and InChIKey, a shorter, fixed-length hash of the InChI
string, is concluded to be the best current method of identifying structures.
The proportion of published, Open Crystallographic Information Files
(CIFs) that are valid with respect to the specification is shown to be improving,
and is around 99% in 2007. The error rate in the conversion of valid
CIFs to Chemical Markup Language (CML) is less than 0.2%. The machine
generation of connection tables from CIFs requires many heuristics, and in
some cases it is impossible to deduce the exact connection table.
CrystalEye, a fully-automated system for the reformulation of the fragmented
crystallographic Web into a structured XML-based repository is described.
Published, Open CIFs can be located and aggregated programmatically
with almost 100% recall. It is shown that, by converting CIF data
to CML, software can be created to use the latest Web standards and technologies
to enhance the ability of Web users to browse, find, keep updated,
download and reuse the latest published crystallography.
A workflow for the high-throughput calculation of solid-state geometry
using a semi-empirical method is described. A wide-range of organic and
inorganic systems provided by CrystalEye are used to test both the data and
the method. Several errors in the method are discovered, many of which can
be attributed to the parameterization process.
An Open NMR experiment to perform high-throughput prediction of 13C
chemical shifts using a GIAO protocol is described. The data and analysis
were provided on publicly-available webpages to enable crowdsourcing, which
assisted in discovering an error rate of 6.1% in the starting data. The protocol
was refined during the work and shown to have an average unsigned error
of 2.24ppm for 13C nuclei of small, rigid molecules; comparable to the errors
observed elsewhere for general structures using HOSE and Neural Network
methods
A treatment of stereochemistry in computer aided organic synthesis
This thesis describes the author’s contributions to a new stereochemical processing module constructed for the ARChem retrosynthesis program. The purpose of the module is to add the ability to perform enantioselective and diastereoselective retrosynthetic disconnections and generate appropriate precursor molecules. The module uses evidence based rules generated from a large database of literature reactions.
Chapter 1 provides an introduction and critical review of the published body of work for computer aided synthesis design. The role of computer perception of key structural features (rings, functions groups etc.) and the construction and use of reaction transforms for generating precursors is discussed. Emphasis is also given to the application of strategies in retrosynthetic analysis. The availability of large reaction databases has enabled a new generation of retrosynthesis design programs to be developed that use automatically generated transforms assembled from published reactions. A brief description of the transform generation method employed by ARChem is given.
Chapter 2 describes the algorithms devised by the author for handling the computer recognition and representation of the stereochemical features found in molecule and reaction scheme diagrams. The approach is generalised and uses flexible recognition patterns to transform information found in chemical diagrams into concise stereo descriptors for computer processing. An algorithm for efficiently comparing and classifying pairs of stereo descriptors is described. This algorithm is central for solving the stereochemical constraints in a variety of substructure matching problems addressed in chapter 3. The concise representation of reactions and transform rules as hyperstructure graphs is described.
Chapter 3 is concerned with the efficient and reliable detection of stereochemical symmetry in both molecules, reactions and rules. A novel symmetry perception algorithm, based on a constraints satisfaction problem (CSP) solver, is described. The use of a CSP solver to implement an isomorph‐free matching algorithm for stereochemical substructure matching is detailed. The prime function of this algorithm is to seek out unique retron locations in target molecules and then to generate precursor molecules without duplications due to symmetry. Novel algorithms for classifying asymmetric, pseudo‐asymmetric and symmetric stereocentres; meso, centro, and C2 symmetric molecules; and the stereotopicity of trigonal (sp2) centres are described.
Chapter 4 introduces and formalises the annotated structural language used to create both retrosynthetic rules and the patterns used for functional group recognition. A novel functional group recognition package is described along with its use to detect important electronic features such as electron‐withdrawing or donating groups and leaving groups. The functional groups and electronic features are used as constraints in retron rules to improve transform relevance.
Chapter 5 details the approach taken to design detailed stereoselective and substrate controlled transforms from organised hierarchies of rules. The rules employ a rich set of constraints annotations that concisely describe the keying retrons. The application of the transforms for collating evidence based scoring parameters from published reaction examples is described. A survey of available reaction databases and the techniques for mining stereoselective reactions is demonstrated. A data mining tool was developed for finding the best reputable stereoselective reaction types for coding as transforms.
For various reasons it was not possible during the research period to fully integrate this work with the ARChem program. Instead, Chapter 6 introduces a novel one‐step retrosynthesis module to test the developed transforms. The retrosynthesis algorithms use the organisation of the transform rule hierarchy to efficiently locate the best retron matches using all applicable stereoselective transforms. This module was tested using a small set of selected target molecules and the generated routes were ranked using a series of measured parameters including: stereocentre clearance and bond cleavage; example reputation; estimated stereoselectivity with reliability; and evidence of tolerated functional groups. In addition a method for detecting regioselectivity issues is presented.
This work presents a number of algorithms using common set and graph theory operations and notations. Appendix A lists the set theory symbols and meanings. Appendix B summarises and defines the common graph theory terminology used throughout this thesis
Biochemistry students' difficulties with the symbolic and visual language used in molecular biology.
Thesis (Ph.D.)-University of KwaZulu-Natal, Pietermaritzburg, 2007.This study reports on recurring difficulties experienced by undergraduate students with respect to understanding and interpretation of certain symbolism, nomenclature, terminology, shorthand notation, models and other visual representations employed in the field of Molecular Biology to communicate information. Based on teaching experience
and guidelines set out by a four-level methodological framework, data on various topic-related difficulties was obtained by inductive analyses of students’ written responses to specifically designed, free-response and focused probes. In addition, interviews, think-aloud exercises and student-generated diagrams were also used to collect information.
Both unanticipated and recurring difficulties were compared with scientifically correct propositional knowledge, categorized and subsequently classified. Students were adept at providing the meaning of the symbol “Δ” in various scientific contexts; however, some failed to recognize its use to depict the deletion of a leucine biosynthesis gene in the
form, Δ leu. “Hazard to leucine”, “change to leucine” and “abbreviation for isoleucine” were some of the erroneous interpretations of this polysemic symbol. Investigations on
these definitions suggest a constructivist approach to knowledge construction and the inappropriate transfer of knowledge from prior mental schemata. The symbol, “::”, was
poorly differentiated by students in its use to indicate gene integration or transposition and in tandem gene fusion. Idiosyncratic perceptions emerged suggesting that it is, for
example, a proteinaceous component linking genes in a chromosome or the centromere itself associated with the mitotic spindle or “electrons” between genes in the same way
that it is symbolically shown in Lewis dot diagrams which illustrate covalent bonding between atoms. In an oligonucleotide shorthand notation, some students used valency to differentiate the phosphite trivalent form of the phosphorus atom from the pentavalent phosphodiester group, yet the concept of valency was poorly understood. By virtue of the visual form of a shorthand notation of the 3,5 phosphodiester link in DNA, the valency was incorrectly read. VSEPR theory and the Octet Rule were misunderstood or forgotten when trying to explain the valency of the phosphorus atom in synthetic oligonucleotide intermediates. Plasmid functional domains were generally well-understood although restriction mapping appeared to be a cognitively demanding task. Rote learning and substitution of definitions were evident in the explanation of promoter and operator
functions. The concept of gene expression posed difficulties to many students who believed that genes contain the entity they encode. Transcription and translation of in tandem gene fusions were poorly explained by some students as was the effect of plasmid conformation on transformation and gene expression. With regard to the selection of transformants or the hybridoma, some students could not engage in reasoning or lateral thinking as protoconcepts and domain-specific information were poorly understood. A failure to integrate and reason with factual information on phenotypic traits, media components and biochemical pathways were evident in written and oral presentations. DNA-strand nomenclature and associated function were problematic to some students as
they failed to differentiate coding strand from template strand and were prone to interchange the labelling of these. A substitution of labels with those characterizing DNA replication intermediates demonstrated erroneous information transfer. DNA replication models posed difficulties integrating molecular mechanisms and detail with line drawings, coupled with inaccurate illustrations of sequential replication features. Finally, a remediation model is presented, demonstrating a shift in assessment score dispersion from a range of 0 - 4.5 to 4 - 9 when learners are guided metacognitively to work with domain-specific or critical knowledge from an information bank. The present work shows that varied forms of symbolism can present students with complex learning difficulties as the underlying information depicted by these is understood in a superficial way. It is imperative that future studies be focused on the standardization of symbol use, perhaps governed by convention that determines the manner in which threshold information is disseminated on symbol use, coupled by innovative teaching strategies which facilitate an improved understanding of the use of symbolic representations in Molecular Biology. As Molecular Biology advances, it is likely that experts will continue to use new and diverse forms of symbolic representations to explain their findings. The explanation of futuristic Science is likely to develop a symbolic language that will impose great teaching
challenges and unimaginable learning difficulties to new generation teachers and learners, respectively
- …