7 research outputs found

    Leveraging Reaction-aware Substructures for Retrosynthesis Analysis

    Full text link
    Retrosynthesis analysis is a critical task in organic chemistry central to many important industries. Previously, various machine learning approaches have achieved promising results on this task by representing output molecules as strings and autoregressively decoded token-by-token with generative models. Text generation or machine translation models in natural language processing were frequently utilized approaches. The token-by-token decoding approach is not intuitive from a chemistry perspective because some substructures are relatively stable and remain unchanged during reactions. In this paper, we propose a substructure-level decoding model, where the substructures are reaction-aware and can be automatically extracted with a fully data-driven approach. Our approach achieved improvement over previously reported models, and we find that the performance can be further boosted if the accuracy of substructure extraction is improved. The substructures extracted by our approach can provide users with better insights for decision-making compared to existing methods. We hope this work will generate interest in this fast growing and highly interdisciplinary area on retrosynthesis prediction and other related topics.Comment: Work in progres

    Retrosynthesis prediction enhanced by in-silico reaction data augmentation

    Full text link
    Recent advances in machine learning (ML) have expedited retrosynthesis research by assisting chemists to design experiments more efficiently. However, all ML-based methods consume substantial amounts of paired training data (i.e., chemical reaction: product-reactant(s) pair), which is costly to obtain. Moreover, companies view reaction data as a valuable asset and restrict the accessibility to researchers. These issues prevent the creation of more powerful retrosynthesis models due to their data-driven nature. As a response, we exploit easy-to-access unpaired data (i.e., one component of product-reactant(s) pair) for generating in-silico paired data to facilitate model training. Specifically, we present RetroWISE, a self-boosting framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation using unpaired data, ultimately leading to a superior model. On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models (e.g., +8.6% top-1 accuracy on the USPTO-50K test dataset). Moreover, it consistently improves the prediction accuracy of rare transformations. These results show that Retro- WISE overcomes the training bottleneck by in-silico reactions, thereby paving the way toward more effective ML-based retrosynthesis models

    Reconstruction of lossless molecular representations from fingerprints

    Get PDF
    The simplified molecular-input line-entry system (SMILES) is the most prevalent molecular representation used in AI-based chemical applications. However, there are innate limitations associated with the internal structure of SMILES representations. In this context, this study exploits the resolution and robustness of unique molecular representations, i.e., SMILES and SELFIES (SELF-referencIng Embedded strings), reconstructed from a set of structural fingerprints, which are proposed and used herein as vital representational tools for chemical and natural language processing (NLP) applications. This is achieved by restoring the connectivity information lost during fingerprint transformation with high accuracy. Notably, the results reveal that seemingly irreversible molecule-to-fingerprint conversion is feasible. More specifically, four structural fingerprints, extended connectivity, topological torsion, atom pairs, and atomic environments can be used as inputs and outputs of chemical NLP applications. Therefore, this comprehensive study addresses the major limitation of structural fingerprints that precludes their use in NLP models. Our findings will facilitate the development of text- or fingerprint-based chemoinformatic models for generative and translational tasks.This work was supported by National Research Foundation of Korea (NRF) grants funded by the Korean government (MSIT) (Nos. NRF-2019M3E5D4066898, NRF-2022R1C1C1005080 and NRF-2020M3A9G7103933 to I.A. and J.L.). This work was also supported by the Korea Environment Industry & Technology Institute (KEITI) through the Technology Development Project for Safety Management of Household Chemical Products, funded by the Korea Ministry of Environment (MOE) (KEITI:2020002960002 and NTIS:1485017120 to U.V.U. and J.L.)

    Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks

    Get PDF
    Over the last hundred years, not much has changed how organic chemistry is conducted. In most laboratories, the current state is still trial-and-error experiments guided by human expertise acquired over decades. What if, given all the knowledge published, we could develop an artificial intelligence-based assistant to accelerate the discovery of novel molecules? Although many approaches were recently developed to generate novel molecules in silico, only a few studies complete the full design-make-test cycle, including the synthesis and the experimental assessment. One reason is that the synthesis part can be tedious, time-consuming, and requires years of experience to perform successfully. Hence, the synthesis is one of the critical limiting factors in molecular discovery. In this thesis, I take advantage of similarities between human language and organic chemistry to apply linguistic methods to chemical reactions, and develop artificial intelligence-based tools for accelerating chemical synthesis. First, I investigate reaction prediction models focusing on small data sets of challenging stereo- and regioselective carbohydrate reactions. Second, I develop a multi-step synthesis planning tool predicting reactants and suitable reagents (e.g. catalysts and solvents). Both forward prediction and retrosynthesis approaches use black-box models. Hence, I then study methods to provide more information about the models’ predictions. I develop a reaction classification model that labels chemical reaction and facilitates the communication of reaction concepts. As a side product of the classification models, I obtain reaction fingerprints that enable efficient similarity searches in chemical reaction space. Moreover, I study approaches for predicting reaction yields. Lastly, after I approached all chemical reaction tasks with atom-mapping independent models, I demonstrate the generation of accurate atom-mapping from the patterns my models have learned while being trained self-supervised on chemical reactions. My PhD thesis’s leitmotif is the use of the attention-based Transformer architecture to molecules and reactions represented with a text notation. It is like atoms are my letters, molecules my words, and reactions my sentences. With this analogy, I teach my neural network models the language of chemical reactions - atom by atom. While exploring the link between organic chemistry and language, I make an essential step towards the automation of chemical synthesis, which could significantly reduce the costs and time required to discover and create new molecules and materials

    Bioaccumulation potential of 'Meeker' and 'Willamette' raspberry (Rubus idaeus L.) fruits towards macro- and microelements and their nutritional evaluation

    Get PDF
    Raspberry (Rubus idaeus L.) is the most important type of berry fruit in the Republic of Serbia. The bioaccumulation factor (BF) for the elements detected in the fruits of the raspberry cultivars 'Willamette' and 'Meeker' was calculated to determine their bioaccumulation potential. In addition, the nutritional quality of fruits in relation to nutritionally essential elements was evaluated and compared with the recommended daily intake. For determining the concentrations of 19 macro- and microelements in fruits and the soil, the analytical technique of optical emission spectrometry with inductively coupled plasma was used. Among the analyzed elements, As, Cd, Co, Cr, Li and Mo were below the limit of detection in the fruits of both raspberry cultivars, whereas Na and Ni were detected only in fruits of the 'Meeker' cultivar. All analyzed elements were detected in the soil. The results of the work indicated the high potential of the studied cultivars to accumulate nutritional elements K and Ca. In both raspberry cultivars, there were no substantial differences in the bioaccumulation of most elements. However, two elements (B and Mn) can be singled out; the BF for B in the 'Willamette' fruit was 3 times lower compared to the BF in the 'Meeker' fruit, whereas, the BF value for Mn in the 'Willamette' fruit was almost 8 times higher compared to the BF value for the 'Meeker' fruit. Furthermore, the cultivars did not tend to accumulate potentially toxic elements such as Ba, Co, Cu and Ni. The nutritional evaluation revealed that the studied raspberry fruits are a good source of K, Ca, Mg, Fe, Mn and Cu. Based on the BF values, differences observed in the accumulation of B, Ba, Na, Ni and Mn may be attributed to the characteristics of the cultivars
    corecore