2,667 research outputs found

    Automated extraction of chemical structure information from digital raster images

    Get PDF
    Background: To search for chemical structures in research articles, diagrams or text representing molecules need to be translated to a standard chemical file format compatible with cheminformatic search engines. Nevertheless, chemical information contained in research articles is often referenced as analog diagrams of chemical structures embedded in digital raster images. To automate analog-to-digital conversion of chemical structure diagrams in scientific research articles, several software systems have been developed. But their algorithmic performance and utility in cheminformatic research have not been investigated. Results: This paper aims to provide critical reviews for these systems and also report our recent development of ChemReader -- a fully automated tool for extracting chemical structure diagrams in research articles and converting them into standard, searchable chemical file formats. Basic algorithms for recognizing lines and letters representing bonds and atoms in chemical structure diagrams can be independently run in sequence from a graphical user interface-and the algorithm parameters can be readily changed-to facilitate additional development specifically tailored to a chemical database annotation scheme. Compared with existing software programs such as OSRA, Kekule, and CLiDE, our results indicate that ChemReader outperforms other software systems on several sets of sample images from diverse sources in terms of the rate of correct outputs and the accuracy on extracting molecular substructure patterns. Conclusion: The availability of ChemReader as a cheminformatic tool for extracting chemical structure information from digital raster images allows research and development groups to enrich their chemical structure databases by annotating the entries with published research articles. Based on its stable performance and high accuracy, ChemReader may be sufficiently accurate for annotating the chemical database with links to scientific research articles.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/90875/1/Saitou8.pd

    Inductive queries for a drug designing robot scientist

    Get PDF
    It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments

    OrChem - An open source chemistry search engine for Oracle®

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Registration, indexing and searching of chemical structures in relational databases is one of the core areas of cheminformatics. However, little detail has been published on the inner workings of search engines and their development has been mostly closed-source. We decided to develop an open source chemistry extension for Oracle, the de facto database platform in the commercial world.</p> <p>Results</p> <p>Here we present OrChem, an extension for the Oracle 11G database that adds registration and indexing of chemical structures to support fast substructure and similarity searching. The cheminformatics functionality is provided by the Chemistry Development Kit. OrChem provides similarity searching with response times in the order of seconds for databases with millions of compounds, depending on a given similarity cut-off. For substructure searching, it can make use of multiple processor cores on today's powerful database servers to provide fast response times in equally large data sets.</p> <p>Availability</p> <p>OrChem is free software and can be redistributed and/or modified under the terms of the GNU Lesser General Public License as published by the Free Software Foundation. All software is available via <url>http://orchem.sourceforge.net</url>.</p

    Entwicklung einer computergestützten Methode zum reaktionsbasierten De-Novo-Design wirkstoffartiger Verbindungen

    Get PDF
    A new method for computer-based de novo design of drug candidate structures is proposed. DOGS (Design of Genuine Structures) features a ligand-based strategy to suggest new molecular structures. The quality of designed compounds is assessed by a graph kernel method measuring the distance of designed molecules to a known reference ligand. Two graph representations of molecules (molecular graph and reduced graph) are implemented to feature different levels of abstraction from the molecular structure. A fully deterministic construction procedure explicitly designed to facilitate synthesizability of proposed structures is realized: DOGS uses readily available synthesis building blocks and established reaction schemes to assemble new molecules. This approach enables the software to propose not only the final compounds, but also to give suggestions for synthesis routes to generate them at the bench. The set of synthesis schemes comprises about 83 chemical reactions. Special focus was put on ring closure reactions forming drug-like substructures. The library of building blocks consists of about 25,000 readily available synthesis building blocks. DOGS builds up new structures in a stepwise process. Each virtual synthesis step adds a fragment to the growing molecule until a stop criterion (upper threshold for molecular mass or number of synthesis steps) is fulfilled. In a theoretical evaluation, a set of ~1,800 molecules proposed by DOGS is analyzed for critical properties of de novo designed compounds. The software is able to suggest drug-like molecules (79% violate less than two of Lipinski’s ‘rule of five’). In addition, a trained classifier for drug-likeness assigns a score >0.8 to 51% of the designed molecules (with 1.0 being the top score). In addition, most of the DOGS molecules are deemed to be synthesizable by a retro-synthesis descriptor (77% of molecules score in the top 10% of the decriptor’s value range). Calculated logP(o/w) values of constructed molecules resemble a unimodal distribution centred close to the mean of logP(o/w) values calculated for the reference compounds. A structural analysis of selected designs reveals that DOGS is capable of constructing molecules reflecting the overall topological arrangement of pharmacophoric features found in the reference ligands. At the same time, the DOGS designs represent innovative compounds being structurally distinct from the references. Synthesis routes for these examples are short and seem feasible in most cases. Some reaction steps might need modification by using protecting groups to avoid unwanted side reactions. Plausible bioisosters for known privileged fragments addressing the S1 pocket of trypsin were proposed by DOGS in a case study. Three of them can be found in known trypsin inhibitors as S1-adressing side chains. The software was also tested in two prospective case studies to design bioactive compounds. DOGS was applied to design ligands for human gamma-secretase and human histamine receptor subtype 4 (hH4R). Two selected designs for gamma-secretase were readily synthesizable as suggested by the software in one-step reactions. Both compounds represent inverse modulators of the target molecule. In a second case study, a ligand candidate selected for hH4R was synthesized exactly following the three-step synthesis plan suggested by DOGS. This compound showed low activity on the target structure. The concept of DOGS is able to deliver synthesizable and bioactive compounds. Suggested synthesis plans of selected compounds were readily pursuable. DOGS can therefore serve as a valuable idea generator for the design of new pharmacological active compounds.Im Rahmen der vorliegenden Arbeit wird eine neue Methode zum computergestützten de novo Design von wirkstoffartigen Molekülen vorgestellt. Ziel ist es, automatisiert und zielgerichtet neuartige Moleküle mit biologischer Aktivität zu entwerfen. Das entwickelte Programm DOGS (Design of Genuine Structures) schlägt zusätzlich zu den chemischen Verbindungen mögliche Strategien zu deren Synthese vor. Ein vollständig deterministischer Konstruktionsalgorithmus verwendet verfügbare Synthesebausteine und etablierte chemische Reaktionen zum Aufbau der neuen Moleküle. Die Bibliothek der Synthesebausteine umfasst etwa 25.000 Moleküle mit einer molekularen Masse zwischen 30 und 300 Da. Die Sammlung der Reaktionen zur Verknüpfung der Bausteine besteht aus 83 literaturbeschriebenen chemischen Reaktionen. Ein Großteil stellt Syntheseschritte zur Generierung neuer Ringsysteme dar. DOGS baut neue Moleküle schrittweise auf: In jedem virtuellen Syntheseschritt wird ein neues Fragment an das wachsende Molekül angefügt, bis eines der Stoppkriterien (Überschreitung einer maximalen molekulare Masse oder Anzahl Syntheseschritte) erfüllt ist. Zur Bewertung der Qualität der Zischen- und Endprodukte wird eine ligandenbasierte Strategie verwendet. Die entstehenden Moleküle werden mit einem bekannten Referenzliganden verglichen, welcher die gewünschte biologische Aktivität aufweist. Das Verfahren zielt dabei auf die Maximierung der Ähnlichkeit der neu konstruierten Moleküle zur Referenz ab. Eine Graphkernmethode berechnet die Ähnlichkeit zum Referenzliganden anhand des Vergleichs ihrer zweidimensionalen molekularen Struktur. In einer theoretischen Auswertung des Programms werden ca. 1.800 generierte potentielle Trypsin-Inhibitoren hinsichtlich solcher Eigenschaften analysiert, welche für neu entworfene Verbindungen kritisch sind: DOGS ist in der Lage wirkstoffartige Moleküle zu entwerfen (79% verletzen weniger als zwei von Lipinskis 'rule of five' Kriterien zur Abschätzung der oralen Bioverfügbarkeit). Zusätzlich wurde die Wirkstoffartigkeit der DOGS-Moleküle durch einen trainierten Klassifizieralgorithmus bewertet. Hierbei erhielten 51% der Verbindungen einen Wert in den oberen 20% des Wertebereichs des Klassifizierers. Weiterhin wird die synthetische Zugänglichkeit für den Großteil der computergenerierten Moleküle als hoch eingeschätzt (77% erhalten einen Wert in den oberen 10% des Wertebereichs eines Deskriptors zur Abschätzung der Synthetisierbarkeit). Die berechneten logP(o/w) Werte der konstruierten Moleküle entsprechen in ihrer Verteilung denen der Referenzliganden. Die Untersuchung der vorgeschlagenen Trypsin-Inhibitoren auf Bioisostere zur Adressierung der S1-Bindetasche zeigt, dass hierfür plausible Vorschläge von DOGS generiert werden. Der Großteil ist potentiell in der Lage eine kritische ladungsvermittelte Interaktion mit dem Protein in der S1-Bindetasche einzugehen. Unter den Vorschlägen befinden sich unter anderem auch drei Seitenketten, für die Interaktionen mit der S1-Bindetasche von Trypsin experimentell bestätigt sind. Eine Analyse ausgewählter Beispiele aus verschiedenen Läufen zum Ligandenentwurf für unterschiedliche biologische Zielmoleküle zeigt, dass das Programm in der Lage ist, die generelle topologische Anordnung potentieller Interaktionspunkte der Referenzliganden in den neu erzeugten Molekülen beizubehalten. Gleichzeitig sind diese Moleküle strukturell verschieden im Vergleich zu den Referenzliganden. Die generierten Synthesewege sind kurz und erscheinen in den meisten Fällen plausibel. Für einige der Syntheseschritte wird bei der praktischen Umsetzung der ergänzende Einsatz von Schutzgruppen notwendig sein, um unerwünschte Nebenreaktionen zu vermeiden. Die Software wurde zusätzlich zu den theoretischen Analysen in prospektiven Studien zum Ligandenentwurf praktisch evaluiert. Hierzu wurde DOGS zur Generierung von Liganden des humanen Histaminrezeptors 4 (hH4R) sowie der humanen gamma-Sekretase eingesetzt. Für hH4R wurde einer der entworfenen potentiellen Liganden synthetisiert, wobei der vorgeschlagene Syntheseweg exakt nachvollzogen werden konnte. Der Ligand weist eine geringfügige Affinität zum Histaminrezeptor auf. Für die gamma-Sekretase wurden zwei der entworfenen Moleküle zur Synthese und Testung ausgewählt. In beiden Fällen konnte auch hier die von DOGS vorgeschlagene Synthesestrategie nachvollzogen werden. Anschließende in vitro Analysen wiesen beide Verbindungen als inverse Modulatoren der gamma-Sekretase aus. Das Konstruktionskonzept von DOGS ist in der Lage, bioaktive Substanzen vorzuschlagen. Diese sind synthetisch zugänglich und können nach der vorgeschlagenen Strategie synthetisiert werden. Somit kann das Programm als Ideengenerator für den Entwurf neuer bioaktiver Moleküle dienen

    A treatment of stereochemistry in computer aided organic synthesis

    Get PDF
    This thesis describes the author’s contributions to a new stereochemical processing module constructed for the ARChem retrosynthesis program. The purpose of the module is to add the ability to perform enantioselective and diastereoselective retrosynthetic disconnections and generate appropriate precursor molecules. The module uses evidence based rules generated from a large database of literature reactions. Chapter 1 provides an introduction and critical review of the published body of work for computer aided synthesis design. The role of computer perception of key structural features (rings, functions groups etc.) and the construction and use of reaction transforms for generating precursors is discussed. Emphasis is also given to the application of strategies in retrosynthetic analysis. The availability of large reaction databases has enabled a new generation of retrosynthesis design programs to be developed that use automatically generated transforms assembled from published reactions. A brief description of the transform generation method employed by ARChem is given. Chapter 2 describes the algorithms devised by the author for handling the computer recognition and representation of the stereochemical features found in molecule and reaction scheme diagrams. The approach is generalised and uses flexible recognition patterns to transform information found in chemical diagrams into concise stereo descriptors for computer processing. An algorithm for efficiently comparing and classifying pairs of stereo descriptors is described. This algorithm is central for solving the stereochemical constraints in a variety of substructure matching problems addressed in chapter 3. The concise representation of reactions and transform rules as hyperstructure graphs is described. Chapter 3 is concerned with the efficient and reliable detection of stereochemical symmetry in both molecules, reactions and rules. A novel symmetry perception algorithm, based on a constraints satisfaction problem (CSP) solver, is described. The use of a CSP solver to implement an isomorph‐free matching algorithm for stereochemical substructure matching is detailed. The prime function of this algorithm is to seek out unique retron locations in target molecules and then to generate precursor molecules without duplications due to symmetry. Novel algorithms for classifying asymmetric, pseudo‐asymmetric and symmetric stereocentres; meso, centro, and C2 symmetric molecules; and the stereotopicity of trigonal (sp2) centres are described. Chapter 4 introduces and formalises the annotated structural language used to create both retrosynthetic rules and the patterns used for functional group recognition. A novel functional group recognition package is described along with its use to detect important electronic features such as electron‐withdrawing or donating groups and leaving groups. The functional groups and electronic features are used as constraints in retron rules to improve transform relevance. Chapter 5 details the approach taken to design detailed stereoselective and substrate controlled transforms from organised hierarchies of rules. The rules employ a rich set of constraints annotations that concisely describe the keying retrons. The application of the transforms for collating evidence based scoring parameters from published reaction examples is described. A survey of available reaction databases and the techniques for mining stereoselective reactions is demonstrated. A data mining tool was developed for finding the best reputable stereoselective reaction types for coding as transforms. For various reasons it was not possible during the research period to fully integrate this work with the ARChem program. Instead, Chapter 6 introduces a novel one‐step retrosynthesis module to test the developed transforms. The retrosynthesis algorithms use the organisation of the transform rule hierarchy to efficiently locate the best retron matches using all applicable stereoselective transforms. This module was tested using a small set of selected target molecules and the generated routes were ranked using a series of measured parameters including: stereocentre clearance and bond cleavage; example reputation; estimated stereoselectivity with reliability; and evidence of tolerated functional groups. In addition a method for detecting regioselectivity issues is presented. This work presents a number of algorithms using common set and graph theory operations and notations. Appendix A lists the set theory symbols and meanings. Appendix B summarises and defines the common graph theory terminology used throughout this thesis

    TeachOpenCADD: a teaching platform for computer-aided drug design using open source packages and data

    Get PDF
    Owing to the increase in freely available software and data for cheminformatics and structural bioinformatics, research for computer-aided drug design (CADD) is more and more built on modular, reproducible, and easy-to-share pipelines. While documentation for such tools is available, there are only a few freely accessible examples that teach the underlying concepts focused on CADD, especially addressing users new to the field. Here, we present TeachOpenCADD, a teaching platform developed by students for students, using open source compound and protein data as well as basic and CADD-related Python packages. We provide interactive Jupyter notebooks for central CADD topics, integrating theoretical background and practical code. TeachOpenCADD is freely available on GitHub: https://github.com/volkamerlab/TeachOpenCAD

    From Quantity to Quality: Massive Molecular Dynamics Simulation of Nanostructures under Plastic Deformation in Desktop and Service Grid Distributed Computing Infrastructure

    Get PDF
    The distributed computing infrastructure (DCI) on the basis of BOINC and EDGeS-bridge technologies for high-performance distributed computing is used for porting the sequential molecular dynamics (MD) application to its parallel version for DCI with Desktop Grids (DGs) and Service Grids (SGs). The actual metrics of the working DG-SG DCI were measured, and the normal distribution of host performances, and signs of log-normal distributions of other characteristics (CPUs, RAM, and HDD per host) were found. The practical feasibility and high efficiency of the MD simulations on the basis of DG-SG DCI were demonstrated during the experiment with the massive MD simulations for the large quantity of aluminum nanocrystals (102\sim10^2-10310^3). Statistical analysis (Kolmogorov-Smirnov test, moment analysis, and bootstrapping analysis) of the defect density distribution over the ensemble of nanocrystals had shown that change of plastic deformation mode is followed by the qualitative change of defect density distribution type over ensemble of nanocrystals. Some limitations (fluctuating performance, unpredictable availability of resources, etc.) of the typical DG-SG DCI were outlined, and some advantages (high efficiency, high speedup, and low cost) were demonstrated. Deploying on DG DCI allows to get new scientific quality\it{quality} from the simulated quantity\it{quantity} of numerous configurations by harnessing sufficient computational power to undertake MD simulations in a wider range of physical parameters (configurations) in a much shorter timeframe.Comment: 13 pages, 11 pages (http://journals.agh.edu.pl/csci/article/view/106

    Structure and singly occupied molecular orbital analysis of anionic tautomers of guanine

    Get PDF
    Recently we reported the discovery of adiabatically bound anions of guanine which might be involved in the processes of DNA damage by low-energy electrons and in charge transfer through DNA. These anions correspond to some tautomers that have been ignored thus far. They were identified using a hybrid quantum mechanical-combinatorial approach in which an energy-based screening was performed on the library of 499 tautomers with their relative energies calculated with quantum chemistry methods. In the current study we analyze the adiabatically bound anions of guanine in two aspects: 1) the geometries and excess electron distributions are analyzed and compared with anions of the most stable neutrals to identify the sources of stability; 2) the chemical space of guanine tautomers is explored to verify if these new tautomers are contained in a particular subspace of the tautomeric space. The first task involves the development of novel approaches – the quantum chemical data like electron density, orbital and information on its bonding/antibonding character are coded into holograms and analyzed using chemoinformatics techniques. The second task is completed using substructure analysis and clustering techniques performed on molecules represented by 2D fingerprints. The major conclusion is that the high stability of adiabatically bound anions originates from the bonding character of the pi orbital occupied by the excess electron. This compensates for the antibonding character that usually causes significant buckling of the ring. Also the excess electron is more homogenously distributed over both rings than in the case of anions of the most stable neutral species. In terms of 2D substructure, the most stable anionic tautomers generally have additional hydrogen atoms at C8 and/or C2 and they don’t have hydrogen atoms attached to C4, C5 and C6. They also form an “island of stability” in the tautomeric space of guanine
    corecore