346 research outputs found

    Evolution of Metabolic Networks: A Computational Framework

    Get PDF
    Background: The metabolic architectures of extant organisms share many key pathways such as the citric acid cycle, glycolysis, or the biosynthesis of most amino acids. Several competing hypotheses for the evolutionary mechanisms that shape metabolic networks have been discussed in the literature, each of which ïŹnds support from comparative analysis of extant genomes. Alternatively, the principles of metabolic evolution can be studied by direct computer simulation. This requires, however, an explicit implementation of all pertinent components: a universe of chemical reaction upon which the metabolism is built, an explicit representation of the enzymes that implement the metabolism, of a genetic system that encodes these enzymes, and of a ïŹtness function that can be selected for. Results: We describe here a simulation environment that implements all these components in a simpliïŹed ways so that large-scale evolutionary studies are feasible. We employ an artiïŹcial chemistry that views chemical reactions as graph rewriting operations and utilizes a toy-version of quantum chemistry to derive thermodynamic parameters. Minimalist organisms with simple string-encoded genomes produce model ribozymes whose catalytic activity is determined by an ad hoc mapping between their secondary structure and the transition state graphs that they stabilize. Fitness is computed utilizing the ideas of metabolic ïŹ‚ux analysis. We present an implementation of the complete system and ïŹrst simulation results. Conclusions: The simulation system presented here allows coherent investigations into the evolutionary mechanisms of the ïŹrst steps of metabolic evolution using a self-consistent toy univers

    Computational Studies on the Evolution of Metabolism

    Get PDF
    Living organisms throughout evolution have developed desired properties, such as the ability of maintaining functionality despite changes in the environment or their inner structure, the formation of functional modules, from metabolic pathways to organs, and most essentially the capacity to adapt and evolve in a process called natural selection. It can be observed in the metabolic networks of modern organisms that many key pathways such as the citric acid cycle, glycolysis, or the biosynthesis of most amino acids are common to all of them. Understanding the evolutionary mechanisms behind this development of complex biological systems is an intriguing and important task of current research in biology as well as artificial life. Several competing hypotheses for the formation of metabolic pathways and the mecha- nisms that shape metabolic networks have been discussed in the literature, each of which finds support from comparative analysis of extant genomes. However, while being powerful tools for the investigation of metabolic evolution, these traditional methods do not allow to look back in evolution far enough to the time when metabolism had to emerge and evolve to the form we can observe today. To this end, simulation studies have been introduced to discover the principles of metabolic evolution and the sources for the emergence of metabolism prop- erties. These approaches differ considerably in the realism and explicitness of the underlying models. A difficult trade-off between realism and computational feasibility has to be made and further modeling decisions on many scales have to be taken into account, requiring the combination of knowledge from different fields such as chemistry, physics, biology and last but not least also computer science. In this thesis, a novel computational model for the in silico evolution of early metabolism is introduced. It comprises all the components on different scales to resemble a situation of evolving metabolic protocells in an RNA-world. Therefore, the model contains a minimal RNA-based genetics and an evolving metabolism of catalytic ribozymes that manipulate a rich underlying chemistry. To allow the metabolic organization to escape from the confines of the chemical space set by the initial conditions of the simulation and in general an open- ended evolution, an evolvable sequence-to-function map is used. At the heart of the metabolic subsystem is a graph-based artificial chemistry equipped with a built-in thermodynamics. The generation of the metabolic reaction network is realized as a rule-based stochastic simulation. The necessary reaction rates are calculated from the chemical graphs of the reactants on the fly. The selection procedure among the population of protocells is based on the optimal metabolic yield of the protocells, which is computed using flux balance analysis. The introduced computational model allows for profound investigations of the evolution of early metabolism and the underlying evolutionary mechanisms. One application in this thesis is the study of the formation of metabolic pathways. Therefore, four established hypothe- ses, namely the backwards evolution, forward evolution, patchwork evolution and the shell hypothesis, are discussed within the realms of this in silico evolution study. The metabolic pathways of the networks, evolved in various simulation runs, are determined and analyzed in terms of their evolutionary direction. The simulation results suggest that the seemingly mutually exclusive hypotheses may well be compatible when considering that different pro- cesses dominate different phases in the evolution of a metabolic system. Further, it is found that forward evolution shapes the metabolic network in the very early steps of evolution. In later and more complex stages, enzyme recruitment supersedes forward evolution, keeping a core set of pathways from the early phase. Backward evolution can only be observed under conditions of steady environmental change. Additionally, evolutionary history of enzymes and metabolites were studied on the network level as well as for single instances, showing a great variety of evolutionary mechanisms at work. The second major focus of the in silico evolutionary study is the emergence of complex system properties, such as robustness and modularity. To this end several techniques to analyze the metabolic systems were used. The measures for complex properties stem from the fields of graph theory, steady state analysis and neutral network theory. Some are used in general network analysis and others were developed specifically for the purpose introduced in this work. To discover potential sources for the emergence of system properties, three different evolutionary scenarios were tested and compared. The first two scenarios are the same as for the first part of the investigation, one scenario of evolution under static conditions and one incorporating a steady change in the set of ”food” molecules. A third scenario was added that also simulates a static evolution but with an increased mutation rate and regular events of horizontal gene transfer between protocells of the population. The comparison of all three scenarios with real world metabolic networks shows a significant similarity in structure and properties. Among the three scenarios, the two static evolutions yield the most robust metabolic networks, however, the networks evolved under environmental change exhibit their own strategy to a robustness more suited to their conditions. As expected from theory, horizontal gene transfer and changes in the environment seem to produce higher degrees of modularity in metabolism. Both scenarios develop rather different kinds of modularity, while horizontal gene transfer provides for more isolated modules, the modules of the second scenario are far more interconnected

    Computational Studies on the Evolution of Metabolism

    Get PDF
    Living organisms throughout evolution have developed desired properties, such as the ability of maintaining functionality despite changes in the environment or their inner structure, the formation of functional modules, from metabolic pathways to organs, and most essentially the capacity to adapt and evolve in a process called natural selection. It can be observed in the metabolic networks of modern organisms that many key pathways such as the citric acid cycle, glycolysis, or the biosynthesis of most amino acids are common to all of them. Understanding the evolutionary mechanisms behind this development of complex biological systems is an intriguing and important task of current research in biology as well as artificial life. Several competing hypotheses for the formation of metabolic pathways and the mecha- nisms that shape metabolic networks have been discussed in the literature, each of which finds support from comparative analysis of extant genomes. However, while being powerful tools for the investigation of metabolic evolution, these traditional methods do not allow to look back in evolution far enough to the time when metabolism had to emerge and evolve to the form we can observe today. To this end, simulation studies have been introduced to discover the principles of metabolic evolution and the sources for the emergence of metabolism prop- erties. These approaches differ considerably in the realism and explicitness of the underlying models. A difficult trade-off between realism and computational feasibility has to be made and further modeling decisions on many scales have to be taken into account, requiring the combination of knowledge from different fields such as chemistry, physics, biology and last but not least also computer science. In this thesis, a novel computational model for the in silico evolution of early metabolism is introduced. It comprises all the components on different scales to resemble a situation of evolving metabolic protocells in an RNA-world. Therefore, the model contains a minimal RNA-based genetics and an evolving metabolism of catalytic ribozymes that manipulate a rich underlying chemistry. To allow the metabolic organization to escape from the confines of the chemical space set by the initial conditions of the simulation and in general an open- ended evolution, an evolvable sequence-to-function map is used. At the heart of the metabolic subsystem is a graph-based artificial chemistry equipped with a built-in thermodynamics. The generation of the metabolic reaction network is realized as a rule-based stochastic simulation. The necessary reaction rates are calculated from the chemical graphs of the reactants on the fly. The selection procedure among the population of protocells is based on the optimal metabolic yield of the protocells, which is computed using flux balance analysis. The introduced computational model allows for profound investigations of the evolution of early metabolism and the underlying evolutionary mechanisms. One application in this thesis is the study of the formation of metabolic pathways. Therefore, four established hypothe- ses, namely the backwards evolution, forward evolution, patchwork evolution and the shell hypothesis, are discussed within the realms of this in silico evolution study. The metabolic pathways of the networks, evolved in various simulation runs, are determined and analyzed in terms of their evolutionary direction. The simulation results suggest that the seemingly mutually exclusive hypotheses may well be compatible when considering that different pro- cesses dominate different phases in the evolution of a metabolic system. Further, it is found that forward evolution shapes the metabolic network in the very early steps of evolution. In later and more complex stages, enzyme recruitment supersedes forward evolution, keeping a core set of pathways from the early phase. Backward evolution can only be observed under conditions of steady environmental change. Additionally, evolutionary history of enzymes and metabolites were studied on the network level as well as for single instances, showing a great variety of evolutionary mechanisms at work. The second major focus of the in silico evolutionary study is the emergence of complex system properties, such as robustness and modularity. To this end several techniques to analyze the metabolic systems were used. The measures for complex properties stem from the fields of graph theory, steady state analysis and neutral network theory. Some are used in general network analysis and others were developed specifically for the purpose introduced in this work. To discover potential sources for the emergence of system properties, three different evolutionary scenarios were tested and compared. The first two scenarios are the same as for the first part of the investigation, one scenario of evolution under static conditions and one incorporating a steady change in the set of ”food” molecules. A third scenario was added that also simulates a static evolution but with an increased mutation rate and regular events of horizontal gene transfer between protocells of the population. The comparison of all three scenarios with real world metabolic networks shows a significant similarity in structure and properties. Among the three scenarios, the two static evolutions yield the most robust metabolic networks, however, the networks evolved under environmental change exhibit their own strategy to a robustness more suited to their conditions. As expected from theory, horizontal gene transfer and changes in the environment seem to produce higher degrees of modularity in metabolism. Both scenarios develop rather different kinds of modularity, while horizontal gene transfer provides for more isolated modules, the modules of the second scenario are far more interconnected

    Computer Aided Synthesis Prediction to Enable Augmented Chemical Discovery and Chemical Space Exploration

    Get PDF
    The drug-like chemical space is estimated to be 10 to the power of 60 molecules, and the largest generated database (GDB) obtained by the Reymond group is 165 billion molecules with up to 17 heavy atoms. Furthermore, deep learning techniques to explore regions of chemical space are becoming more popular. However, the key to realizing the generated structures experimentally lies in chemical synthesis. The application of which was previously limited to manual planning or slow computer assisted synthesis planning (CASP) models. Despite the 60-year history of CASP few synthesis planning tools have been open-sourced to the community. In this thesis I co-led the development of and investigated one of the only fully open-source synthesis planning tools called AiZynthFinder, trained on both public and proprietary datasets consisting of up to 17.5 million reactions. This enables synthesis guided exploration of the chemical space in a high throughput manner, to bridge the gap between compound generation and experimental realisation. I firstly investigate both public and proprietary reaction data, and their influence on route finding capability. Furthermore, I develop metrics for assessment of retrosynthetic prediction, single-step retrosynthesis models, and automated template extraction workflows. This is supplemented by a comparison of the underlying datasets and their corresponding models. Given the prevalence of ring systems in the GDB and wider medicinal chemistry domain, I developed ‘Ring Breaker’ - a data-driven approach to enable the prediction of ring-forming reactions. I demonstrate its utility on frequently found and unprecedented ring systems, in agreement with literature syntheses. Additionally, I highlight its potential for incorporation into CASP tools, and outline methodological improvements that result in the improvement of route-finding capability. To tackle the challenge of model throughput, I report a machine learning (ML) based classifier called the retrosynthetic accessibility score (RAscore), to assess the likelihood of finding a synthetic route using AiZynthFinder. The RAscore computes at least 4,500 times faster than AiZynthFinder. Thus, opens the possibility of pre-screening millions of virtual molecules from enumerated databases or generative models for synthesis informed compound prioritization. Finally, I combine chemical library visualization with synthetic route prediction to facilitate experimental engagement with synthetic chemists. I enable the navigation of chemical property space by using interactive visualization to deliver associated synthetic data as endpoints. This aids in the prioritization of compounds. The ability to view synthetic route information alongside structural descriptors facilitates a feedback mechanism for the improvement of CASP tools and enables rapid hypothesis testing. I demonstrate the workflow as applied to the GDB databases to augment compound prioritization and synthetic route design

    Structure generation and de novo design using reaction networks

    Get PDF
    This project is concerned with de novo molecular design whereby novel molecules are built in silico and evaluated against properties relevant to biological activity, such as physicochemical properties and structural similarity to active compounds. The aim is to encourage cost-effective compound design by reducing the number of molecules requiring synthesis and analysis. One of the main issues in de novo design is ensuring that the molecules generated are synthesisable. In this project, a method is developed that enables virtual synthesis using rules derived from reaction sequences. Individual reactions taken from reaction databases were connected to form reaction networks. Reaction sequences were then extracted by tracing paths through the network and used to create ‘reaction sequence vectors’ (RSVs) which encode the differences between the start and end points of th esequences. RSVs can be applied to molecules to generate virtual products which are based on literature precedents. The RSVs were applied to structure-activity relationship (SAR) exploration using examples taken from the literature. They were shown to be effective in expanding the chemical space that is accessible from the given starting materials. Furthermore, each virtual product is associated with a potential synthetic route. They were then applied in de novo design scenarios with the aim of generating molecules that are predicted to be active using SAR models. Using a collection of RSVs with a set of small molecules as starting materials for de novo design proved that the method was capable of producing many useful, synthesisable compounds worthy of future study. The RSV method was then compared with a previously published method that is based on individual reactions (reaction vectors or RVs). The RSV approach was shown to be considerably faster than de novo design using RVs, however, the diversity of products was more limited

    Kinetic model construction using chemoinformatics

    Get PDF
    Kinetic models of chemical processes not only provide an alternative to costly experiments; they also have the potential to accelerate the pace of innovation in developing new chemical processes or in improving existing ones. Kinetic models are most powerful when they reflect the underlying chemistry by incorporating elementary pathways between individual molecules. The downside of this high level of detail is that the complexity and size of the models also steadily increase, such that the models eventually become too difficult to be manually constructed. Instead, computers are programmed to automate the construction of these models, and make use of graph theory to translate chemical entities such as molecules and reactions into computer-understandable representations. This work studies the use of automated methods to construct kinetic models. More particularly, the need to account for the three-dimensional arrangement of atoms in molecules and reactions of kinetic models is investigated and illustrated by two case studies. First of all, the thermal rearrangement of two monoterpenoids, cis- and trans-2-pinanol, is studied. A kinetic model that accounts for the differences in reactivity and selectivity of both pinanol diastereomers is proposed. Secondly, a kinetic model for the pyrolysis of the fuel “JP-10” is constructed and highlights the use of state-of-the-art techniques for the automated estimation of thermochemistry of polycyclic molecules. A new code is developed for the automated construction of kinetic models and takes advantage of the advances made in the field of chemo-informatics to tackle fundamental issues of previous approaches. Novel algorithms are developed for three important aspects of automated construction of kinetic models: the estimation of symmetry of molecules and reactions, the incorporation of stereochemistry in kinetic models, and the estimation of thermochemical and kinetic data using scalable structure-property methods. Finally, the application of the code is illustrated by the automated construction of a kinetic model for alkylsulfide pyrolysis

    Large and multi scale mechanistic modeling of Diels-Alder reactions

    Get PDF
    The [4+2] cycloaddition reaction between conjugated dienes and substituted alkenes is known as the Diels-Alder (DA) reaction, in honor of two German chemists, Otto Diels and Kurt Alder, who first reported this marvelous chemical transformation. The DA reaction is one of the most popular reactions in organic chemistry, allowing for the regio- and stereospecific establishment of six-membered rings with up to four stereogenic centers. This pericyclic reaction has found many applications in areas as diverse as natural products chemistry, polymer chemistry, and agrochemistry. Over the past decades, the mechanism of the Diels-Alder (DA) reaction has been the subject of numerous studies, dealing with questions as diverse as the mechanistic pathway, the synchronicity, the use of catalysts, the effect of solvents and salts, etc. On the other hand, as an example, fullerenes (and particularly [60] fullerene) have been found to act as good dienophiles in DA reactions to the extent that many functionalized fullerenes with interesting applications are still synthesized by reacting C60 with dienes. However, despite the very abundant literature about the mechanism of the DA reaction, some pertinent questions have been still pending, including, without being restricted to, the prediction of transition state (TS) geometries and the modeling of DA reactions involving large systems, such as those of C60 fullerene. It must be emphasized that TSs are not easy to predict and the main reason is that many existing algorithms require that the search is initiated from a good starting point (guess TS), which must be very similar to the actual TS. This problem is even more difficult when many TSs are to be located as may be the case in large-scale studies. Moreover, due to the large size of the C60 molecule, the usage of accurate high-level computational methods in the investigation of its reactivity towards dienes is computationally costly, implying the need to find the best threshold between accuracy and computational cost. Therefore, the present study was carried out to contribute to solving the problems of large-scale prediction of DA transition state geometries and the multi-scale modeling of C60 fullerene DA reactions. To address the first problem (large-scale prediction of TSs), we have developed a python program named “AMADAR”, which predicts an unlimited number of DA transition states, using only the SMILES strings of the cycloadducts. AMADAR is customizable and allows for the description of intramolecular DA reactions as well as systems resulting in competing paths. In addition, The AMADAR tool contains two separate modules that perform reaction force analyses and atomic decomposition of energy derivatives from the predicted Intrinsic Reaction Coordinates (IRC) paths. The performance of AMADAR was assessed using 2000 DA cycloadducts and showed a success rate of ~ 95%. Most of the errors were due to basis set inconsistencies or convergence issues that we are still working on. Furthermore, a set of 150 IRC paths generated by the AMADAR program were analyzed to get insight into the (a)synchronicity of DA reactions. This investigation confirmed that the reaction force constant (second derivatives of the system energy with respect to the reaction coordinate) was a good indicator of synchronicity in DA reactions. A close inspection of the profile of has enabled us to propose an alternative classification of DA reactions based on their synchronicity degree, in terms of (quasi)-synchronous, moderate asynchronous, asynchronous, and likely two-steps DA reactions. Natural population analyses seemed to indicate that the global maximum of the reaction force constant could be identified with the formation of all the bonds in the reaction site. Finally, the atomic resolution of energy derivatives suggested that the mechanism of the DA reaction involves two inner elementary processes associated with the formation of each C-C bond. A striking mechanistic difference between synchronous and asynchronous DA reactions emerging from this study is that, in asynchronous reactions, the driving and retarding forces are mainly caused by the fast and slow-forming bonds (elementary process) respectively, while in the case of synchronous ones both elementary processes retard and drive the process concomitantly and equivalently. Regarding the DA reaction of C60 fullerene that was considered to illustrate the problem of multiscale modeling, we have constructed 12 ONIOM2 and 10 ONIOM3 models combining five semi-empirical methods (AM1, PM3, PM3MM, PDDG, PM6) and the LDA(SVWN) functional in conjunction with the B3LYP/6-31G(d) level. Then, their accuracy and efficiency were assessed in comparison with the pure B3LYP/6-31G(d) level considering first the DA reaction between C60 and cyclopentadiene whose experimental data are available. Further, different DFT functionals were employed in place of the B3LYP functional to describe the higher-layer of the best ONIOM partition, and the results obtained were compared to experimental data. At this step, the ONIOM2(M06-2X/6-31 G(d): SVWN/STO-3G) model, where the higher layer encompasses the diene and pyracyclene portion of C60, was found to provide the best tradeoff between accuracy and cost, with respect to experimental data. This model showed errors lower than 2.6 and 2.0 kcal/mol for the estimation of the activation and reaction enthalpies respectively. We have also demonstrated, by comparing several ONIOM2(DFT/6-31G(d): SVWN/STO-3G) models, the importance of dispersion corrections in the accurate estimation of reaction and activation energies. Finally, we have considered a set of 21 dienes, including anthracene, 1,3-butadiene, 1,3-cyclopentadiene, furan, thiophene, selenothiophene, pyrrole and their mono-cyano and hydroxyl derivatives to get insight into the DA reaction of C60 using the best ONIOM2(M06-2X/6-31 G(d): SVWN/STO-3G) model. For a given diene and its derivatives, the analysis of frontier molecular orbitals provides a consistent explanation for the substituent effect on the activation barrier. It revealed that electron-donating (withdrawing) groups such as -OH (–CN) cut down on the activation barrier of the reaction by lowering (extending) of the HOMOdiene – LUMOC60 gap and consequently enhancing (weakening) the interaction between the two reactants. Further, the decomposition of the activation energy into the strain and interaction components suggested that, for a given diene, electron-donating groups (here –OH) diminish the height of the activation barrier not only by favoring the attractive interaction between the diene and C60, but also by reducing the strain energy of the system; the opposite effect is observed for electron-withdrawing groups (here –CN). In contrast with some previous findings on typical DA reactions, we could not infer any general rule applicable to the entire dataset for the prediction of activation energies because the latter does not correlate well with either of the TS polarity, electrophilicity of the diene, or the reaction energy.Thesis (MSc) -- Faculty of Science, Chemistry, 202

    Enhancing Reaction-based de novo Design using Machine Learning

    Get PDF
    De novo design is a branch of chemoinformatics that is concerned with the rational design of molecular structures with desired properties, which specifically aims at achieving suitable pharmacological and safety profiles when applied to drug design. Scoring, construction, and search methods are the main components that are exploited by de novo design programs to explore the chemical space to encourage the cost-effective design of new chemical entities. In particular, construction methods are concerned with providing strategies for compound generation to address issues such as drug-likeness and synthetic accessibility. Reaction-based de novo design consists of combining building blocks according to transformation rules that are extracted from collections of known reactions, intending to restrict the enumerated chemical space into a manageable number of synthetically accessible structures. The reaction vector is an example of a representation that encodes topological changes occurring in reactions, which has been integrated within a structure generation algorithm to increase the chances of generating molecules that are synthesisable. The general aim of this study was to enhance reaction-based de novo design by developing machine learning approaches that exploit publicly available data on reactions. A series of algorithms for reaction standardisation, fingerprinting, and reaction vector database validation were introduced and applied to generate new data on which the entirety of this work relies. First, these collections were applied to the validation of a new ligand-based design tool. The tool was then used in a case study to design compounds which were eventually synthesised using very similar procedures to those suggested by the structure generator. A reaction classification model and a novel hierarchical labelling system were then developed to introduce the possibility of applying transformations by class. The model was augmented with an algorithm for confidence estimation, and was used to classify two datasets from industry and the literature. Results from the classification suggest that the model can be used effectively to gain insights on the nature of reaction collections. Classified reactions were further processed to build a reaction class recommendation model capable of suggesting appropriate reaction classes to apply to molecules according to their fingerprints. The model was validated, then integrated within the reaction vector-based design framework, which was assessed on its performance against the baseline algorithm. Results from the de novo design experiments indicate that the use of the recommendation model leads to a higher synthetic accessibility and a more efficient management of computational resources

    Computational methods for small molecules

    Get PDF
    Metabolism is the system of chemical reactions sustaining life in the cells of living organisms. It is responsible for cellular processes that break down nutrients for energy and produce building blocks for necessary molecules. The study of metabolism is vital to many disciplines in medicine and pharmacy. Chemical reactions operate on small molecules called metabolites, which form the core of metabolism. In this thesis we propose efficient computational methods for small molecules in metabolic applications. In this thesis we discuss four distinctive studies covering two major themes: the atom-level description of biochemical reactions, and analysis of tandem mass spectrometric measurements of metabolites. In the first part we study atom-level descriptions of organic reactions. We begin by proposing an optimal algorithm for determining the atom-to-atom correspondences between the reactant and product metabolites of organic reactions. In addition, we introduce a graph edit distance based cost as the mathematical formalism to determine optimality of atom mappings. We continue by proposing a compact single-graph representation of reactions using the atom mappings. We investigate the utility of the new representation in a reaction function classification task, where a descriptive category of the reaction's function is predicted. To facilitate the prediction, we introduce the first feasible path-based graph kernel, which describes the reactions as path sequences to high classification accuracy. In the second part we turn our focus on analysing tandem mass spectrometric measurements of metabolites. In a tandem mass spectrometer, an input molecule structure is fragmented into substructures or fragments, whose masses are observed. We begin by studying the fragment identification problem. A combinatorial algorithm is presented to enumerate candidate substructures based on the given masses. We also demonstrate the usefulness of utilising approximated bond energies as a cost function to rank the candidate structures according to their chemical feasibility. We propose fragmentation tree models to describe the dependencies between fragments for higher identification accuracy. We continue by studying a closely related problem where an unknown metabolite is elucidated based on its tandem mass spectrometric fragment signals. This metabolite identification task is an important problem in metabolomics, underpinning the subsequent modelling and analysis efforts. We propose an automatic machine learning framework to predict a set of structural properties of the unknown metabolite. The properties are turned into candidate structures by a novel statistical model. We introduce the first mass spectral kernels and explore three feature classes to facilitate the prediction. The kernels introduce support for high-accuracy mass spectrometric measurements for enhanced predictive accuracy.TÀssÀ vÀitöskirjassa esitetÀÀn tehokkaita laskennallisia menetelmiÀ pienille molekyyleille aineenvaihduntasovelluksissa. Aineenvaihdunta on kemiallisten reaktioiden jÀrjestelmÀ, joka yllÀpitÀÀ elÀmÀÀ solutasolla. Aineenvaihduntaprosessit hajottavat ravinteita energiaksi ja rakennusaineiksi soluille tarpeellisten molekyylien valmistamiseen. Kemiallisten reaktioiden muokkaamia pieniÀ molekyylejÀ kutsutaan metaboliiteiksi. TÀmÀ vÀitöskirja sisÀltÀÀ neljÀ itsenÀistÀ tutkimusta, jotka jakautuvat teemallisesti biokemiallisten reaktioiden atomitason kuvaamiseen ja metaboliittien massaspektrometriamittausten analysointiin. VÀitöskirjan ensimmÀisessÀ osassa kÀsitellÀÀn biokemiallisten reaktioiden atomitason kuvauksia. VÀitöskirjassa esitellÀÀn optimaalinen algoritmi reaktioiden lÀhtö- ja tuoteaineiden vÀlisten atomikuvausten mÀÀrittÀmiseen. Optimaalisuus mÀÀrittyy verkkojen editointietÀisyyteen perustuvalla kustannusfunktiolla. Optimaalinen atomikuvaus mahdollistaa reaktion kuvaamisen yksikÀsitteisesti yhdellÀ verkolla. Uutta reaktiokuvausta hyödynnetÀÀn reaktion funktion ennustustehtÀvÀssÀ, jossa pyritÀÀn mÀÀrittÀmÀÀn reaktiota sanallisesti kuvaava kategoria automaattisesti. VÀitöskirjassa esitetÀÀn polku-perustainen verkkokerneli, joka kuvaa reaktiot atomien polkusekvensseinÀ verrattuna aiempiin kulkusekvensseihin saavuttaen paremman ennustustarkkuuden. VÀitöskirjan toisessa osassa analysoidaan metaboliittien tandem-massaspektrometriamittauksia. Tandem-massaspektrometri hajottaa analysoitavan syötemolekyylin fragmenteiksi ja mittaa niiden massa-varaus suhteet. VÀitöskirjassa esitetÀÀn perusteellinen kombinatorinen algoritmi fragmenttien tunnistamiseen. MenetelmÀn kustannusfunktio perustuu fragmenttien sidosenergioiden vertailuun. Lopuksi vÀitöskirjassa esitetÀÀn fragmentaatiopuut, joiden avulla voidaan mallintaa fragmenttien vÀlisiÀ suhteita ja saavuttaa parempi tunnistustarkkuus. Fragmenttien tunnistuksen ohella voidaan tunnistaa myös analysoitavia metaboliitteja. Ongelma on merkittÀvÀ ja edellytys aineenvaihdunnun analyyseille. VÀitöskirjassa esitetÀÀn koneoppimismenetelmÀ, joka ennustaa tuntemattoman metaboliitin rakennetta kuvaavia piirteitÀ ja muodostaa niiden perusteella rakenne-ennusteita tilastollisesti. MenetelmÀ esittelee ensimmÀiset erityisesti massaspektrometriadataan soveltuvat kernel-funktiot ja saavuttaa hyvÀn ennustustarkkuuden

    Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks

    Get PDF
    Over the last hundred years, not much has changed how organic chemistry is conducted. In most laboratories, the current state is still trial-and-error experiments guided by human expertise acquired over decades. What if, given all the knowledge published, we could develop an artificial intelligence-based assistant to accelerate the discovery of novel molecules? Although many approaches were recently developed to generate novel molecules in silico, only a few studies complete the full design-make-test cycle, including the synthesis and the experimental assessment. One reason is that the synthesis part can be tedious, time-consuming, and requires years of experience to perform successfully. Hence, the synthesis is one of the critical limiting factors in molecular discovery. In this thesis, I take advantage of similarities between human language and organic chemistry to apply linguistic methods to chemical reactions, and develop artificial intelligence-based tools for accelerating chemical synthesis. First, I investigate reaction prediction models focusing on small data sets of challenging stereo- and regioselective carbohydrate reactions. Second, I develop a multi-step synthesis planning tool predicting reactants and suitable reagents (e.g. catalysts and solvents). Both forward prediction and retrosynthesis approaches use black-box models. Hence, I then study methods to provide more information about the models’ predictions. I develop a reaction classification model that labels chemical reaction and facilitates the communication of reaction concepts. As a side product of the classification models, I obtain reaction fingerprints that enable efficient similarity searches in chemical reaction space. Moreover, I study approaches for predicting reaction yields. Lastly, after I approached all chemical reaction tasks with atom-mapping independent models, I demonstrate the generation of accurate atom-mapping from the patterns my models have learned while being trained self-supervised on chemical reactions. My PhD thesis’s leitmotif is the use of the attention-based Transformer architecture to molecules and reactions represented with a text notation. It is like atoms are my letters, molecules my words, and reactions my sentences. With this analogy, I teach my neural network models the language of chemical reactions - atom by atom. While exploring the link between organic chemistry and language, I make an essential step towards the automation of chemical synthesis, which could significantly reduce the costs and time required to discover and create new molecules and materials
    • 

    corecore