222 research outputs found

    Retrosynthetic reaction prediction using neural sequence-to-sequence models

    Full text link
    We describe a fully data driven model that learns to perform a retrosynthetic reaction prediction task, which is treated as a sequence-to-sequence mapping problem. The end-to-end trained model has an encoder-decoder architecture that consists of two recurrent neural networks, which has previously shown great success in solving other sequence-to-sequence prediction tasks such as machine translation. The model is trained on 50,000 experimental reaction examples from the United States patent literature, which span 10 broad reaction types that are commonly used by medicinal chemists. We find that our model performs comparably with a rule-based expert system baseline model, and also overcomes certain limitations associated with rule-based expert systems and with any machine learning approach that contains a rule-based expert system component. Our model provides an important first step towards solving the challenging problem of computational retrosynthetic analysis

    SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents

    Get PDF
    The patent literature is a rich catalog of biologically relevant chemicals; many public and commercial molecular databases contain the structures disclosed in patent claims. However, patents are an equally rich source of metadata about bioactive molecules, including mechanism of action, disease class, homologous experimental series, structural alternatives, or the synthetic pathways used to produce molecules of interest. Unfortunately, this metadata is discarded when chemical structures are deposited separately in databases. SCRIPDB is a chemical structure database designed to make this metadata accessible. SCRIPDB provides the full original patent text, reactions and relationships described within any individual patent, in addition to the molecular files common to structural databases. We discuss how such information is valuable in medical text mining, chemical image analysis, reaction extraction and in silico pharmaceutical lead optimization. SCRIPDB may be searched by exact chemical structure, substructure or molecular similarity and the results may be restricted to patents describing synthetic routes. SCRIPDB is available at http://dcv.uhnres.utoronto.ca/SCRIPDB

    AI-Driven Synthetic Route Design Incorporated with Retrosynthesis Knowledge

    Get PDF
    Computer-aided synthesis planning (CASP) aims to assist chemists in performing retrosynthetic analysis for which they utilize their experiments, intuition, and knowledge. Recent breakthroughs in machine learning (ML) techniques, including deep neural networks, have significantly improved data-driven synthetic route designs without human intervention. However, learning chemical knowledge by ML for practical synthesis planning has not yet been adequately achieved and remains a challenging problem. In this study, we developed a data-driven CASP application integrated with various portions of retrosynthesis knowledge called “ReTReK” that introduces the knowledge as adjustable parameters into the evaluation of promising search directions. The experimental results showed that ReTReK successfully searched synthetic routes based on the specified retrosynthesis knowledge, indicating that the synthetic routes searched with the knowledge were preferred to those without the knowledge. The concept of integrating retrosynthesis knowledge as adjustable parameters into a data-driven CASP application is expected to enhance the performance of both existing data-driven CASP applications and those under development

    A treatment of stereochemistry in computer aided organic synthesis

    Get PDF
    This thesis describes the author’s contributions to a new stereochemical processing module constructed for the ARChem retrosynthesis program. The purpose of the module is to add the ability to perform enantioselective and diastereoselective retrosynthetic disconnections and generate appropriate precursor molecules. The module uses evidence based rules generated from a large database of literature reactions. Chapter 1 provides an introduction and critical review of the published body of work for computer aided synthesis design. The role of computer perception of key structural features (rings, functions groups etc.) and the construction and use of reaction transforms for generating precursors is discussed. Emphasis is also given to the application of strategies in retrosynthetic analysis. The availability of large reaction databases has enabled a new generation of retrosynthesis design programs to be developed that use automatically generated transforms assembled from published reactions. A brief description of the transform generation method employed by ARChem is given. Chapter 2 describes the algorithms devised by the author for handling the computer recognition and representation of the stereochemical features found in molecule and reaction scheme diagrams. The approach is generalised and uses flexible recognition patterns to transform information found in chemical diagrams into concise stereo descriptors for computer processing. An algorithm for efficiently comparing and classifying pairs of stereo descriptors is described. This algorithm is central for solving the stereochemical constraints in a variety of substructure matching problems addressed in chapter 3. The concise representation of reactions and transform rules as hyperstructure graphs is described. Chapter 3 is concerned with the efficient and reliable detection of stereochemical symmetry in both molecules, reactions and rules. A novel symmetry perception algorithm, based on a constraints satisfaction problem (CSP) solver, is described. The use of a CSP solver to implement an isomorph‐free matching algorithm for stereochemical substructure matching is detailed. The prime function of this algorithm is to seek out unique retron locations in target molecules and then to generate precursor molecules without duplications due to symmetry. Novel algorithms for classifying asymmetric, pseudo‐asymmetric and symmetric stereocentres; meso, centro, and C2 symmetric molecules; and the stereotopicity of trigonal (sp2) centres are described. Chapter 4 introduces and formalises the annotated structural language used to create both retrosynthetic rules and the patterns used for functional group recognition. A novel functional group recognition package is described along with its use to detect important electronic features such as electron‐withdrawing or donating groups and leaving groups. The functional groups and electronic features are used as constraints in retron rules to improve transform relevance. Chapter 5 details the approach taken to design detailed stereoselective and substrate controlled transforms from organised hierarchies of rules. The rules employ a rich set of constraints annotations that concisely describe the keying retrons. The application of the transforms for collating evidence based scoring parameters from published reaction examples is described. A survey of available reaction databases and the techniques for mining stereoselective reactions is demonstrated. A data mining tool was developed for finding the best reputable stereoselective reaction types for coding as transforms. For various reasons it was not possible during the research period to fully integrate this work with the ARChem program. Instead, Chapter 6 introduces a novel one‐step retrosynthesis module to test the developed transforms. The retrosynthesis algorithms use the organisation of the transform rule hierarchy to efficiently locate the best retron matches using all applicable stereoselective transforms. This module was tested using a small set of selected target molecules and the generated routes were ranked using a series of measured parameters including: stereocentre clearance and bond cleavage; example reputation; estimated stereoselectivity with reliability; and evidence of tolerated functional groups. In addition a method for detecting regioselectivity issues is presented. This work presents a number of algorithms using common set and graph theory operations and notations. Appendix A lists the set theory symbols and meanings. Appendix B summarises and defines the common graph theory terminology used throughout this thesis

    Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks

    Get PDF
    Over the last hundred years, not much has changed how organic chemistry is conducted. In most laboratories, the current state is still trial-and-error experiments guided by human expertise acquired over decades. What if, given all the knowledge published, we could develop an artificial intelligence-based assistant to accelerate the discovery of novel molecules? Although many approaches were recently developed to generate novel molecules in silico, only a few studies complete the full design-make-test cycle, including the synthesis and the experimental assessment. One reason is that the synthesis part can be tedious, time-consuming, and requires years of experience to perform successfully. Hence, the synthesis is one of the critical limiting factors in molecular discovery. In this thesis, I take advantage of similarities between human language and organic chemistry to apply linguistic methods to chemical reactions, and develop artificial intelligence-based tools for accelerating chemical synthesis. First, I investigate reaction prediction models focusing on small data sets of challenging stereo- and regioselective carbohydrate reactions. Second, I develop a multi-step synthesis planning tool predicting reactants and suitable reagents (e.g. catalysts and solvents). Both forward prediction and retrosynthesis approaches use black-box models. Hence, I then study methods to provide more information about the models’ predictions. I develop a reaction classification model that labels chemical reaction and facilitates the communication of reaction concepts. As a side product of the classification models, I obtain reaction fingerprints that enable efficient similarity searches in chemical reaction space. Moreover, I study approaches for predicting reaction yields. Lastly, after I approached all chemical reaction tasks with atom-mapping independent models, I demonstrate the generation of accurate atom-mapping from the patterns my models have learned while being trained self-supervised on chemical reactions. My PhD thesis’s leitmotif is the use of the attention-based Transformer architecture to molecules and reactions represented with a text notation. It is like atoms are my letters, molecules my words, and reactions my sentences. With this analogy, I teach my neural network models the language of chemical reactions - atom by atom. While exploring the link between organic chemistry and language, I make an essential step towards the automation of chemical synthesis, which could significantly reduce the costs and time required to discover and create new molecules and materials

    Understanding How Synthetic Organic Chemistry Graduate Students Navigate Scifinder

    Get PDF
    Students pursuing a Ph.D. degree are expected to contribute research to their field, for which the success depends, in part, on their ability to find, interpret, and use scholarly information from the primary literature. However, studies from the information sciences show that graduate students from a variety of fields, including the sciences, frequently struggle to comprehensively search their respective dissertation topics because of insufficient prior content knowledge and lack of guidance from their disciplinary community. This body of literature is consistent with the results of my previous research of chemistry graduate students’ laboratory decision-making processes. Specifically, that study showed their search and evaluation of the scientific literature to find a research protocol was critical to the success or failure of the students’ research. For these reasons, I chose to investigate how synthetic organic chemistry graduate students perform literature searches, using SciFinder, to find protocols for preparing previously unreported compounds. For my study, I used situated cognition and communities of practice (CoP) as my theoretical frameworks in conjunction with an ethnomethodological research design. Five organic chemistry graduate students were interviewed to understand their strategies and sense-making procedures for searching the literature, specifically focusing on how they decide to: 1). input a topic or structural representation, 2). evaluate the search results, and 3). use specific procedures for deciding which of the protocols to carry out in the laboratory. The findings from my study indicated that the graduate students’ information-seeking behaviors and sense-making procedures were directly influenced by their domain-specific content knowledge and their exposure to the organic CoP. Specifically, the second-year and third-year graduate students heavily depended on the database because their domain-specific content knowledge was not operational at this stage in their training. Comparatively, the sixth-year graduate students could easily use their organic chemistry knowledge—i.e. named organic reactions, functional group chemistry, and the retrosynthetic approach—to propose a research protocol; therefore, they used the database to substantiate their synthesis protocols and/or to find a method to synthesis the proposed starting materials. As a result of their exposure to their Ph.D. research, the graduates had become more proficient with using the database and developed heuristics to evaluate their searches, thereby allowing them to quickly evaluate the often times substantial amount of hits. Finally, the findings indicated that the graduate students were utilizing their CoP in different ways. For instance, the second-year and third-year graduate students would seek their advisor’s approval, whereas the sixth-year graduate students would seek their peers’ feedback regarding their protocols. Findings from my study can broadly be integrated into the information science field to enhance and improve undergraduate and graduate students’ ISB. Furthermore, my findings can be applied to improve how we educate and train organic chemistry students (both at the undergraduate- and graduate-level)

    Computational Studies on Cellular Metabolism:From Biochemical Pathways to Complex Metabolic Networks

    Get PDF
    Biotechnology promises the biologically and ecologically sustainable production of commodity chemicals, biofuels, pharmaceuticals and other high-value products using industrial platform microorganisms. Metabolic engineering plays a key role in this process, providing the tools for targeted modifications of microbial metabolism to create efficient microbial cell factories that convert low value substrates to value-added chemicals. Engineering microbes for the bioproduction of chemicals has been practiced through three different approaches: (i) optimization of native pathways of a host organism; (ii) incorporation of heterologous pathways in an amenable organism; and finally (iii) design and introduction of synthetic pathways in an organism. So far, the progress that has been made in the biosynthesis of chemicals was mostly achieved using the first two approaches. Nevertheless, many novel biosynthetic pathways for the production of native and non-native compounds that have potential to provide near-theoretical yields and high specific production rates of chemicals remain yet to be discovered. Therefore, the third approach is crucial for the advancement of bio-based production of value-added chemicals. We need to fully comprehend and analyze the existing knowledge of metabolism in order to generate new hypotheses and design de novo pathways. In this thesis, through development and application of efficient computational methods, we took the research path to expand our understanding of cell metabolism with the aim to discover novel knowledge about metabolic networks. We analyze different aspects of metabolism through five distinct studies. In the first study, we begin with a holistic view of the enzymatic reactions across all the species, and we propose a computational approach for identifying all the theoretically possible enzymatic reactions based on the known biochemistry. We organize our results in a web-based database called âAtlas of biochemistryâ. In the second study, we focus on one of the most structurally diverse and ubiquitous constituents of metabolism, the lipid metabolism. Here we propose a computational framework for integrating lipid species with unknown metabolic/catabolic pathways into metabolic networks. In our next study, we investigate the full metabolic capacity of E. coli. We explore computationally all enzymatic potentials of this organism, and we introduce the âSuper E. coliâ, a new and advanced chassis for metabolic engineering studies. Our next contribution concentrates on the development of a new method for the atom-level description of metabolic networks. We demonstrate the significance of our approach through the reconstruction of atom-level map of the E. coli central metabolism. In the last study, we turn our focus on studying the thermodynamics of metabolism and we present our original approach for estimating the thermodynamic properties of an important class of metabolites. So far, the available thermodynamic properties either from experiments or the computational methods are estimated with respect to the standard conditions, which are different from typical biological conditions. Our workflow paves the way for reliable computing of thermochemical properties of biomolecules at biological conditions of temperature and pressure. Finally, in the conclusion chapter, we discuss the outlook of this work and the potential further applications of the computational methods that were developed in this thesis

    Cascade Reactions to Access Bioactive Scaffolds

    Get PDF
    In the recent decade there has been a shift in drug development to favor planar, aromatic small molecules with easy synthetic access, despite centuries of research in bioactive natural products, which are often highly rigid, three-dimensional structures like spirocycles. These scaffolds remain underexplored in drug development efforts, predominantly due to the challenges associated with their synthesis, and lack of a general, convergent methodology. To address these challenges, we have designed an O–H Insertion/Conia-ene reaction cascade between homopropargylic alcohols and acceptor/acceptor diazo compounds, which uses dual Rh/Au+ catalytic system. This cascade occurs instantly at room temperature, and has been applied towards the synthesis of substituted tetrahydrofurans when linear diazo compounds are used. Thus far, the cascade accommodates a variety of substituted diazo compounds with carboxylic acids/alcohols to provide functionalized tetrahydrofurans, and g-butyrolactones, with a high degree of regio- and stereo-selectivity. Next, we were able to extend the utility of our O–H Insertion/Conia-ene reaction cascade towards the synthesis of spiroheterocycles by employing cyclic diazo substrates with propargylic alcohols. This convergent approach furnishes an array of spiroheterocycles by employing the same dual Rh/Au+ catalytic system in refluxing dichloromethane. This approach has proven general, and was used to synthesize a substrate scope of twenty-four substrates based on natural product scaffolds, including spirobarbituates, spiromeldrum’s acids, spirooxindoles, and the spirocyclic core of the pseurotin natural products. Lastly, we have extended our X–H Insertion/Conia-ene strategy towards uncommon nucleophiles, for the synthesis of sulfur- and all-carbon spirocycles. When propargylic thiols are employed as substrates with linear diazos, we have found that the S-H insertion reaction proceeds in high yield, and Conia-ene cyclization can be promoted when the reaction is conducted in a stepwise fashion. However, when the reaction is conducted in a single pot, we isolated a new, thiofuranofuran compound, which we expect forms via undesired 5-endo-dig cyclization of the propargylic thiol, followed by cyclopropanation and subsequent ring opening. Additionally, by changing our retrosynthetic approach to an intramolecular disconnection, we were able to synthesize an all-carbon spirocycle through a benzylic C-H Insertion/Conia-ene cascade, by using a catalytic mixture consisting of Rh2(HFB)4, ClAuPPh3, and CuOTf in refluxing dichloromethane. In an orthogonal research effort, we have also developed a metal-free cascade for the synthesis of aromatic heterocycles. This cascade uses precursors synthesized from readily accessible 2’-hydroxy/aminochalcones, and commences with a DBU-mediated intramolecular aldol condensation, which occurs within 90 minutes at room temperature, to generate a 1,3,5-triene. This triene is heated overnight (80 – 120 °C) to promote a 6p-electrocyclization, and oxidative aromatization to generate a new aromatic ring. This cascade has proven general, and has been applied towards the synthesis of benzo[c]coumarins, phenanthradinones, dibenzofurans, and carbazoles, up to a 1-gram scale. The cascade reactions developed throughout the course of this dissertation research provide new retrosynthetic strategies for the formation of natural product cores, which could be used to expand the chemical space in drug discovery
    corecore