349 research outputs found
Evolution of Metabolic Networks: A Computational Framework
Background: The metabolic architectures of extant organisms share many key pathways such as the citric acid
cycle, glycolysis, or the biosynthesis of most amino acids. Several competing hypotheses for the evolutionary
mechanisms that shape metabolic networks have been discussed in the literature, each of which finds support
from comparative analysis of extant genomes. Alternatively, the principles of metabolic evolution can be studied
by direct computer simulation. This requires, however, an explicit implementation of all pertinent components: a
universe of chemical reaction upon which the metabolism is built, an explicit representation of the enzymes that
implement the metabolism, of a genetic system that encodes these enzymes, and of a fitness function that can
be selected for.
Results: We describe here a simulation environment that implements all these components in a simplified ways so
that large-scale evolutionary studies are feasible. We employ an artificial chemistry that views chemical reactions as
graph rewriting operations and utilizes a toy-version of quantum chemistry to derive thermodynamic parameters.
Minimalist organisms with simple string-encoded genomes produce model ribozymes whose catalytic activity is
determined by an ad hoc mapping between their secondary structure and the transition state graphs that they
stabilize. Fitness is computed utilizing the ideas of metabolic flux analysis. We present an implementation of the
complete system and first simulation results.
Conclusions: The simulation system presented here allows coherent investigations into the evolutionary mechanisms of the first steps of metabolic evolution using a self-consistent toy univers
Computational Studies on the Evolution of Metabolism
Living organisms throughout evolution have developed desired properties, such as the ability
of maintaining functionality despite changes in the environment or their inner structure, the
formation of functional modules, from metabolic pathways to organs, and most essentially
the capacity to adapt and evolve in a process called natural selection. It can be observed in
the metabolic networks of modern organisms that many key pathways such as the citric acid
cycle, glycolysis, or the biosynthesis of most amino acids are common to all of them.
Understanding the evolutionary mechanisms behind this development of complex biological
systems is an intriguing and important task of current research in biology as well as artificial
life. Several competing hypotheses for the formation of metabolic pathways and the mecha-
nisms that shape metabolic networks have been discussed in the literature, each of which finds
support from comparative analysis of extant genomes. However, while being powerful tools
for the investigation of metabolic evolution, these traditional methods do not allow to look
back in evolution far enough to the time when metabolism had to emerge and evolve to the
form we can observe today. To this end, simulation studies have been introduced to discover
the principles of metabolic evolution and the sources for the emergence of metabolism prop-
erties. These approaches differ considerably in the realism and explicitness of the underlying
models. A difficult trade-off between realism and computational feasibility has to be made
and further modeling decisions on many scales have to be taken into account, requiring the
combination of knowledge from different fields such as chemistry, physics, biology and last
but not least also computer science.
In this thesis, a novel computational model for the in silico evolution of early metabolism
is introduced. It comprises all the components on different scales to resemble a situation of
evolving metabolic protocells in an RNA-world. Therefore, the model contains a minimal
RNA-based genetics and an evolving metabolism of catalytic ribozymes that manipulate a
rich underlying chemistry. To allow the metabolic organization to escape from the confines
of the chemical space set by the initial conditions of the simulation and in general an open-
ended evolution, an evolvable sequence-to-function map is used. At the heart of the metabolic
subsystem is a graph-based artificial chemistry equipped with a built-in thermodynamics. The
generation of the metabolic reaction network is realized as a rule-based stochastic simulation.
The necessary reaction rates are calculated from the chemical graphs of the reactants on
the fly. The selection procedure among the population of protocells is based on the optimal metabolic yield of the protocells, which is computed using flux balance analysis.
The introduced computational model allows for profound investigations of the evolution of
early metabolism and the underlying evolutionary mechanisms. One application in this thesis
is the study of the formation of metabolic pathways. Therefore, four established hypothe-
ses, namely the backwards evolution, forward evolution, patchwork evolution and the shell
hypothesis, are discussed within the realms of this in silico evolution study. The metabolic
pathways of the networks, evolved in various simulation runs, are determined and analyzed
in terms of their evolutionary direction. The simulation results suggest that the seemingly
mutually exclusive hypotheses may well be compatible when considering that different pro-
cesses dominate different phases in the evolution of a metabolic system. Further, it is found
that forward evolution shapes the metabolic network in the very early steps of evolution. In
later and more complex stages, enzyme recruitment supersedes forward evolution, keeping a
core set of pathways from the early phase. Backward evolution can only be observed under
conditions of steady environmental change. Additionally, evolutionary history of enzymes
and metabolites were studied on the network level as well as for single instances, showing a
great variety of evolutionary mechanisms at work.
The second major focus of the in silico evolutionary study is the emergence of complex system
properties, such as robustness and modularity. To this end several techniques to analyze the
metabolic systems were used. The measures for complex properties stem from the fields of
graph theory, steady state analysis and neutral network theory. Some are used in general
network analysis and others were developed specifically for the purpose introduced in this
work. To discover potential sources for the emergence of system properties, three different
evolutionary scenarios were tested and compared. The first two scenarios are the same as
for the first part of the investigation, one scenario of evolution under static conditions and
one incorporating a steady change in the set of ”food” molecules. A third scenario was
added that also simulates a static evolution but with an increased mutation rate and regular
events of horizontal gene transfer between protocells of the population. The comparison of all
three scenarios with real world metabolic networks shows a significant similarity in structure
and properties. Among the three scenarios, the two static evolutions yield the most robust
metabolic networks, however, the networks evolved under environmental change exhibit their
own strategy to a robustness more suited to their conditions. As expected from theory,
horizontal gene transfer and changes in the environment seem to produce higher degrees
of modularity in metabolism. Both scenarios develop rather different kinds of modularity,
while horizontal gene transfer provides for more isolated modules, the modules of the second
scenario are far more interconnected
Computational Studies on the Evolution of Metabolism
Living organisms throughout evolution have developed desired properties, such as the ability
of maintaining functionality despite changes in the environment or their inner structure, the
formation of functional modules, from metabolic pathways to organs, and most essentially
the capacity to adapt and evolve in a process called natural selection. It can be observed in
the metabolic networks of modern organisms that many key pathways such as the citric acid
cycle, glycolysis, or the biosynthesis of most amino acids are common to all of them.
Understanding the evolutionary mechanisms behind this development of complex biological
systems is an intriguing and important task of current research in biology as well as artificial
life. Several competing hypotheses for the formation of metabolic pathways and the mecha-
nisms that shape metabolic networks have been discussed in the literature, each of which finds
support from comparative analysis of extant genomes. However, while being powerful tools
for the investigation of metabolic evolution, these traditional methods do not allow to look
back in evolution far enough to the time when metabolism had to emerge and evolve to the
form we can observe today. To this end, simulation studies have been introduced to discover
the principles of metabolic evolution and the sources for the emergence of metabolism prop-
erties. These approaches differ considerably in the realism and explicitness of the underlying
models. A difficult trade-off between realism and computational feasibility has to be made
and further modeling decisions on many scales have to be taken into account, requiring the
combination of knowledge from different fields such as chemistry, physics, biology and last
but not least also computer science.
In this thesis, a novel computational model for the in silico evolution of early metabolism
is introduced. It comprises all the components on different scales to resemble a situation of
evolving metabolic protocells in an RNA-world. Therefore, the model contains a minimal
RNA-based genetics and an evolving metabolism of catalytic ribozymes that manipulate a
rich underlying chemistry. To allow the metabolic organization to escape from the confines
of the chemical space set by the initial conditions of the simulation and in general an open-
ended evolution, an evolvable sequence-to-function map is used. At the heart of the metabolic
subsystem is a graph-based artificial chemistry equipped with a built-in thermodynamics. The
generation of the metabolic reaction network is realized as a rule-based stochastic simulation.
The necessary reaction rates are calculated from the chemical graphs of the reactants on
the fly. The selection procedure among the population of protocells is based on the optimal metabolic yield of the protocells, which is computed using flux balance analysis.
The introduced computational model allows for profound investigations of the evolution of
early metabolism and the underlying evolutionary mechanisms. One application in this thesis
is the study of the formation of metabolic pathways. Therefore, four established hypothe-
ses, namely the backwards evolution, forward evolution, patchwork evolution and the shell
hypothesis, are discussed within the realms of this in silico evolution study. The metabolic
pathways of the networks, evolved in various simulation runs, are determined and analyzed
in terms of their evolutionary direction. The simulation results suggest that the seemingly
mutually exclusive hypotheses may well be compatible when considering that different pro-
cesses dominate different phases in the evolution of a metabolic system. Further, it is found
that forward evolution shapes the metabolic network in the very early steps of evolution. In
later and more complex stages, enzyme recruitment supersedes forward evolution, keeping a
core set of pathways from the early phase. Backward evolution can only be observed under
conditions of steady environmental change. Additionally, evolutionary history of enzymes
and metabolites were studied on the network level as well as for single instances, showing a
great variety of evolutionary mechanisms at work.
The second major focus of the in silico evolutionary study is the emergence of complex system
properties, such as robustness and modularity. To this end several techniques to analyze the
metabolic systems were used. The measures for complex properties stem from the fields of
graph theory, steady state analysis and neutral network theory. Some are used in general
network analysis and others were developed specifically for the purpose introduced in this
work. To discover potential sources for the emergence of system properties, three different
evolutionary scenarios were tested and compared. The first two scenarios are the same as
for the first part of the investigation, one scenario of evolution under static conditions and
one incorporating a steady change in the set of ”food” molecules. A third scenario was
added that also simulates a static evolution but with an increased mutation rate and regular
events of horizontal gene transfer between protocells of the population. The comparison of all
three scenarios with real world metabolic networks shows a significant similarity in structure
and properties. Among the three scenarios, the two static evolutions yield the most robust
metabolic networks, however, the networks evolved under environmental change exhibit their
own strategy to a robustness more suited to their conditions. As expected from theory,
horizontal gene transfer and changes in the environment seem to produce higher degrees
of modularity in metabolism. Both scenarios develop rather different kinds of modularity,
while horizontal gene transfer provides for more isolated modules, the modules of the second
scenario are far more interconnected
Computer Aided Synthesis Prediction to Enable Augmented Chemical Discovery and Chemical Space Exploration
The drug-like chemical space is estimated to be 10 to the power of 60 molecules, and the largest generated database (GDB) obtained by the Reymond group is 165 billion molecules with up to 17 heavy atoms. Furthermore, deep learning techniques to explore regions of chemical space are becoming more popular. However, the key to realizing the generated structures experimentally lies in chemical synthesis. The application of which was previously limited to manual planning or slow computer assisted synthesis planning (CASP) models. Despite the 60-year history of CASP few synthesis planning tools have been open-sourced to the community. In this thesis I co-led the development of and investigated one of the only fully open-source synthesis planning tools called AiZynthFinder, trained on both public and proprietary datasets consisting of up to 17.5 million reactions. This enables synthesis guided exploration of the chemical space in a high throughput manner, to bridge the gap between compound generation and experimental realisation.
I firstly investigate both public and proprietary reaction data, and their influence on route finding capability. Furthermore, I develop metrics for assessment of retrosynthetic prediction, single-step retrosynthesis models, and automated template extraction workflows. This is supplemented by a comparison of the underlying datasets and their corresponding models.
Given the prevalence of ring systems in the GDB and wider medicinal chemistry domain, I developed ‘Ring Breaker’ - a data-driven approach to enable the prediction of ring-forming reactions. I demonstrate its utility on frequently found and unprecedented ring systems, in agreement with literature syntheses. Additionally, I highlight its potential for incorporation into CASP tools, and outline methodological improvements that result in the improvement of route-finding capability.
To tackle the challenge of model throughput, I report a machine learning (ML) based classifier called the retrosynthetic accessibility score (RAscore), to assess the likelihood of finding a synthetic route using AiZynthFinder. The RAscore computes at least 4,500 times faster than AiZynthFinder. Thus, opens the possibility of pre-screening millions of virtual molecules from enumerated databases or generative models for synthesis informed compound prioritization.
Finally, I combine chemical library visualization with synthetic route prediction to facilitate experimental engagement with synthetic chemists. I enable the navigation of chemical property space by using interactive visualization to deliver associated synthetic data as endpoints. This aids in the prioritization of compounds. The ability to view synthetic route information alongside structural descriptors facilitates a feedback mechanism for the improvement of CASP tools and enables rapid hypothesis testing. I demonstrate the workflow as applied to the GDB databases to augment compound prioritization and synthetic route design
Structure generation and de novo design using reaction networks
This project is concerned with de novo molecular design whereby novel molecules are built in silico and evaluated against properties relevant to biological activity, such as physicochemical properties and structural similarity to active compounds. The aim is to encourage cost-effective compound design by reducing the number of molecules requiring synthesis and analysis.
One of the main issues in de novo design is ensuring that the molecules generated are synthesisable. In this project, a method is developed that enables virtual synthesis using rules derived from reaction sequences. Individual reactions taken from reaction databases were connected to form reaction networks. Reaction sequences were then extracted by tracing paths through the network and used to create ‘reaction sequence vectors’ (RSVs) which encode the differences between the start and end points of th esequences. RSVs can be applied to molecules to generate virtual products which are
based on literature precedents.
The RSVs were applied to structure-activity relationship (SAR) exploration using examples taken from the literature. They were shown to be effective in expanding the chemical space that is accessible from the given starting materials. Furthermore, each virtual product is associated with a potential synthetic route. They were then applied in de novo design scenarios with the aim of generating molecules that are predicted to be active using SAR models. Using a collection of RSVs with a set of small molecules as starting materials for de novo design proved that the method was capable of producing
many useful, synthesisable compounds worthy of future study.
The RSV method was then compared with a previously published method that is based on individual reactions (reaction vectors or RVs). The RSV approach was shown to be considerably faster than de novo design using RVs, however, the diversity of products was more limited
Kinetic model construction using chemoinformatics
Kinetic models of chemical processes not only provide an alternative to costly experiments; they also have the potential to accelerate the pace of innovation in developing new chemical processes or in improving existing ones. Kinetic models are most powerful when they reflect the underlying chemistry by incorporating elementary pathways between individual molecules. The downside of this high level of detail is that the complexity and size of the models also steadily increase, such that the models eventually become too difficult to be manually constructed. Instead, computers are programmed to automate the construction of these models, and make use of graph theory to translate chemical entities such as molecules and reactions into computer-understandable representations.
This work studies the use of automated methods to construct kinetic models. More particularly, the need to account for the three-dimensional arrangement of atoms in molecules and reactions of kinetic models is investigated and illustrated by two case studies. First of all, the thermal rearrangement of two monoterpenoids, cis- and trans-2-pinanol, is studied. A kinetic model that accounts for the differences in reactivity and selectivity of both pinanol diastereomers is proposed. Secondly, a kinetic model for the pyrolysis of the fuel “JP-10” is constructed and highlights the use of state-of-the-art techniques for the automated estimation of thermochemistry of polycyclic molecules.
A new code is developed for the automated construction of kinetic models and takes advantage of the advances made in the field of chemo-informatics to tackle fundamental issues of previous approaches. Novel algorithms are developed for three important aspects of automated construction of kinetic models: the estimation of symmetry of molecules and reactions, the incorporation of stereochemistry in kinetic models, and the estimation of thermochemical and kinetic data using scalable structure-property methods. Finally, the application of the code is illustrated by the automated construction of a kinetic model for alkylsulfide pyrolysis
Computational methods for small molecules
Metabolism is the system of chemical reactions sustaining life in the cells of living organisms. It is responsible for cellular processes that break down nutrients for energy and produce building blocks for necessary molecules. The study of metabolism is vital to many disciplines in medicine and pharmacy. Chemical reactions operate on small molecules called metabolites, which form the core of metabolism. In this thesis we propose efficient computational methods for small molecules in metabolic applications. In this thesis we discuss four distinctive studies covering two major themes: the atom-level description of biochemical reactions, and analysis of tandem mass spectrometric measurements of metabolites.
In the first part we study atom-level descriptions of organic reactions. We begin by proposing an optimal algorithm for determining the atom-to-atom correspondences between the reactant and product metabolites of organic reactions. In addition, we introduce a graph edit distance based cost as the mathematical formalism to determine optimality of atom mappings. We continue by proposing a compact single-graph representation of reactions using the atom mappings. We investigate the utility of the new representation in a reaction function classification task, where a descriptive category of the reaction's function is predicted. To facilitate the prediction, we introduce the first feasible path-based graph kernel, which describes the reactions as path sequences to high classification accuracy.
In the second part we turn our focus on analysing tandem mass spectrometric measurements of metabolites. In a tandem mass spectrometer, an input molecule structure is fragmented into substructures or fragments, whose masses are observed. We begin by studying the fragment identification problem. A combinatorial algorithm is presented to enumerate candidate substructures based on the given masses. We also demonstrate the usefulness of utilising approximated bond energies as a cost function to rank the candidate structures according to their chemical feasibility. We propose fragmentation tree models to describe the dependencies between fragments for higher identification accuracy.
We continue by studying a closely related problem where an unknown metabolite is elucidated based on its tandem mass spectrometric fragment signals. This metabolite identification task is an important problem in metabolomics, underpinning the subsequent modelling and analysis efforts. We propose an automatic machine learning framework to predict a set of structural properties of the unknown metabolite. The properties are turned into candidate structures by a novel statistical model. We introduce the first mass spectral kernels and explore three feature classes to facilitate the prediction. The kernels introduce support for high-accuracy mass spectrometric measurements for enhanced predictive accuracy.Tässä väitöskirjassa esitetään tehokkaita laskennallisia menetelmiä pienille molekyyleille aineenvaihduntasovelluksissa. Aineenvaihdunta on kemiallisten reaktioiden järjestelmä, joka ylläpitää elämää solutasolla. Aineenvaihduntaprosessit hajottavat ravinteita energiaksi ja rakennusaineiksi soluille tarpeellisten molekyylien valmistamiseen. Kemiallisten reaktioiden muokkaamia pieniä molekyylejä kutsutaan metaboliiteiksi. Tämä väitöskirja sisältää neljä itsenäistä tutkimusta, jotka jakautuvat teemallisesti biokemiallisten reaktioiden atomitason kuvaamiseen ja metaboliittien massaspektrometriamittausten analysointiin.
Väitöskirjan ensimmäisessä osassa käsitellään biokemiallisten reaktioiden atomitason kuvauksia. Väitöskirjassa esitellään optimaalinen algoritmi reaktioiden lähtö- ja tuoteaineiden välisten atomikuvausten määrittämiseen. Optimaalisuus määrittyy verkkojen editointietäisyyteen perustuvalla kustannusfunktiolla. Optimaalinen atomikuvaus mahdollistaa reaktion kuvaamisen yksikäsitteisesti yhdellä verkolla. Uutta reaktiokuvausta hyödynnetään reaktion funktion ennustustehtävässä, jossa pyritään määrittämään reaktiota sanallisesti kuvaava kategoria automaattisesti. Väitöskirjassa esitetään polku-perustainen verkkokerneli, joka kuvaa reaktiot atomien polkusekvensseinä verrattuna aiempiin kulkusekvensseihin saavuttaen paremman ennustustarkkuuden.
Väitöskirjan toisessa osassa analysoidaan metaboliittien tandem-massaspektrometriamittauksia. Tandem-massaspektrometri hajottaa analysoitavan syötemolekyylin fragmenteiksi ja mittaa niiden massa-varaus suhteet. Väitöskirjassa esitetään perusteellinen kombinatorinen algoritmi fragmenttien tunnistamiseen. Menetelmän kustannusfunktio perustuu fragmenttien sidosenergioiden vertailuun. Lopuksi väitöskirjassa esitetään fragmentaatiopuut, joiden avulla voidaan mallintaa fragmenttien välisiä suhteita ja saavuttaa parempi tunnistustarkkuus.
Fragmenttien tunnistuksen ohella voidaan tunnistaa myös analysoitavia metaboliitteja. Ongelma on merkittävä ja edellytys aineenvaihdunnun analyyseille. Väitöskirjassa esitetään koneoppimismenetelmä, joka ennustaa tuntemattoman metaboliitin rakennetta kuvaavia piirteitä ja muodostaa niiden perusteella rakenne-ennusteita tilastollisesti. Menetelmä esittelee ensimmäiset erityisesti massaspektrometriadataan soveltuvat kernel-funktiot ja saavuttaa hyvän ennustustarkkuuden
Enhancing Reaction-based de novo Design using Machine Learning
De novo design is a branch of chemoinformatics that is concerned with the rational design of molecular structures with desired properties, which specifically aims at achieving suitable pharmacological and safety profiles when applied to drug design. Scoring, construction, and search methods are the main components that are exploited by de novo design programs to explore the chemical space to encourage the cost-effective design of new chemical entities. In particular, construction methods are concerned with providing strategies for compound generation to address issues such as drug-likeness and synthetic accessibility.
Reaction-based de novo design consists of combining building blocks according to transformation rules that are extracted from collections of known reactions, intending to restrict the enumerated chemical space into a manageable number of synthetically accessible structures. The reaction vector is an example of a representation that encodes topological changes occurring in reactions, which has been integrated within a structure generation algorithm to increase the chances of generating molecules that are synthesisable.
The general aim of this study was to enhance reaction-based de novo design by developing machine learning approaches that exploit publicly available data on reactions. A series of algorithms for reaction standardisation, fingerprinting, and reaction vector database validation were introduced and applied to generate new data on which the entirety of this work relies. First, these collections were applied to the validation of a new ligand-based design tool. The tool was then used in a case study to design compounds which were eventually synthesised using very similar procedures to those suggested by the structure generator.
A reaction classification model and a novel hierarchical labelling system were then developed to introduce the possibility of applying transformations by class. The model was augmented with an algorithm for confidence estimation, and was used to classify two datasets from industry and the literature. Results from the classification suggest that the model can be used effectively to gain insights on the nature of reaction collections.
Classified reactions were further processed to build a reaction class recommendation model capable of suggesting appropriate reaction classes to apply to molecules according to their fingerprints. The model was validated, then integrated within the reaction vector-based design framework, which was assessed on its performance against the baseline algorithm. Results from the de novo design experiments indicate that the use of the recommendation model leads to a higher synthetic accessibility and a more efficient management of computational resources
Automatic learning for the classification of chemical reactions and in statistical thermodynamics
This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds.
NMR-based classification of photochemical and enzymatic reactions. Photochemical
and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen
SOMs) and Random Forests (RFs) taking as input the difference between the 1H
NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data.
A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases
was able to correctly classify 75% of an independent test set in terms of the EC
number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used.
This study was performed with NMR data simulated from the molecular structure by
the SPINUS program. In the design of one test set, simulated data was combined with
experimental data. The results support the proposal of linking databases of chemical
reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions.
Genome-scale classification of enzymatic reactions from their reaction equation.
The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP
descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken,
changed, and made during a chemical reaction.
The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods
should be available to automatically compare metabolic reactions and for the automatic
assignment of EC numbers to reactions still not officially classified.
In this study, the genome-scale data set of enzymatic reactions available in the KEGG
database was encoded by the MOLMAP descriptors, and was submitted to Kohonen
SOMs to compare the resulting map with the official EC number classification, to explore
the possibility of predicting EC numbers from the reaction equation, and to assess the
internal consistency of the EC classification at the class level.
A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions.
RFs were also used to assign the four levels of the EC hierarchy from the reaction
equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases
(for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively.
In the course of the experiments with metabolic reactions we suggested that the
MOLMAP / SOM concept could be extended to the representation of other levels of
metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different
types of metabolism and pathways that do not share similarities in terms of EC numbers.
Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential
function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function.
The results indicated that for LJ-type potentials, NNs can be trained to generate
accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better
results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown.
The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping
of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo,
the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes.
Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task.
The data consisted of energy values, from Density Functional Theory (DFT) calculations,
at different distances, for several molecular orientations and three electrode
adsorption sites. The results indicate that NNs require a data set large enough to cover
well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions.
Therefore, they can be used in molecular simulations, particularly for the ethanol/Au
(111) interface which is the case studied in the present Thesis. Once properly trained,
the networks are able to produce, as output, any required number of energy points for
accurate interpolations
Learning the Language of Chemical Reactions – Atom by Atom. Linguistics-Inspired Machine Learning Methods for Chemical Reaction Tasks
Over the last hundred years, not much has changed how organic chemistry is conducted. In most laboratories, the current state is still trial-and-error experiments guided by human expertise acquired over decades. What if, given all the knowledge published, we could develop an artificial intelligence-based assistant to accelerate the discovery of novel molecules? Although many approaches were recently developed to generate novel molecules in silico, only a few studies complete the full design-make-test cycle, including the synthesis and the experimental assessment. One reason is that the synthesis part can be tedious, time-consuming, and requires years of experience to perform successfully. Hence, the synthesis is one of the critical limiting factors in molecular discovery.
In this thesis, I take advantage of similarities between human language and organic chemistry to apply linguistic methods to chemical reactions, and develop artificial intelligence-based tools for accelerating chemical synthesis. First, I investigate reaction prediction models focusing on small data sets of challenging stereo- and regioselective carbohydrate reactions. Second, I develop a multi-step synthesis planning tool predicting reactants and suitable reagents (e.g. catalysts and solvents). Both forward prediction and retrosynthesis approaches use black-box models. Hence, I then study methods to provide more information about the models’ predictions. I develop a reaction classification model that labels chemical reaction and facilitates the communication of reaction concepts. As a side product of the classification models, I obtain reaction fingerprints that enable efficient similarity searches in chemical reaction space. Moreover, I study approaches for predicting reaction yields. Lastly, after I approached all chemical reaction tasks with atom-mapping independent models, I demonstrate the generation of accurate atom-mapping from the patterns my models have learned while being trained self-supervised on chemical reactions.
My PhD thesis’s leitmotif is the use of the attention-based Transformer architecture to molecules and reactions represented with a text notation. It is like atoms are my letters, molecules my words, and reactions my sentences. With this analogy, I teach my neural network models the language of chemical reactions - atom by atom. While exploring the link between organic chemistry and language, I make an essential step towards the automation of chemical synthesis, which could significantly reduce the costs and time required to discover and create new molecules and materials
- …