9 research outputs found
Generic Strategies for Chemical Space Exploration
Computational approaches to exploring "chemical universes", i.e., very large
sets, potentially infinite sets of compounds that can be constructed by a
prescribed collection of reaction mechanisms, in practice suffer from a
combinatorial explosion. It quickly becomes impossible to test, for all pairs
of compounds in a rapidly growing network, whether they can react with each
other. More sophisticated and efficient strategies are therefore required to
construct very large chemical reaction networks.
Undirected labeled graphs and graph rewriting are natural models of chemical
compounds and chemical reactions. Borrowing the idea of partial evaluation from
functional programming, we introduce partial applications of rewrite rules.
Binding substrate to rules increases the number of rules but drastically prunes
the substrate sets to which it might match, resulting in dramatically reduced
resource requirements. At the same time, exploration strategies can be guided,
e.g. based on restrictions on the product molecules to avoid the explicit
enumeration of very unlikely compounds. To this end we introduce here a generic
framework for the specification of exploration strategies in graph-rewriting
systems. Using key examples of complex chemical networks from sugar chemistry
and the realm of metabolic networks we demonstrate the feasibility of a
high-level strategy framework.
The ideas presented here can not only be used for a strategy-based chemical
space exploration that has close correspondence of experimental results, but
are much more general. In particular, the framework can be used to emulate
higher-level transformation models such as illustrated in a small puzzle game
Exploring the GDB-13 chemical space using deep generative models
Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample
Intrinsic and extrinsic thermodynamics for stochastic population processes with multi-level large-deviation structure
A set of core features is set forth as the essence of a thermodynamic
description, which derive from large-deviation properties in systems with
hierarchies of timescales, but which are \emph{not} dependent upon conservation
laws or microscopic reversibility in the substrate hosting the process. The
most fundamental elements are the concept of a macrostate in relation to the
large-deviation entropy, and the decomposition of contributions to
irreversibility among interacting subsystems, which is the origin of the
dependence on a concept of heat in both classical and stochastic
thermodynamics. A natural decomposition is shown to exist, into a relative
entropy and a housekeeping entropy rate, which define respectively the
\textit{intensive} thermodynamics of a system and an \textit{extensive}
thermodynamic vector embedding the system in its context. Both intensive and
extensive components are functions of Hartley information of the momentary
system stationary state, which is information \emph{about} the joint effect of
system processes on its contribution to irreversibility. Results are derived
for stochastic Chemical Reaction Networks, including a Legendre duality for the
housekeeping entropy rate to thermodynamically characterize fully-irreversible
processes on an equal footing with those at the opposite limit of
detailed-balance. The work is meant to encourage development of inherent
thermodynamic descriptions for rule-based systems and the living state, which
are not conceived as reductive explanations to heat flows
Analysis of Generative Chemistries
For the modelling of chemistry we use undirected, labelled graphs as explicit models of molecules and graph transformation rules for modelling generalised chemical reactions. This is used to define artificial chemistries on the level of individual bonds and atoms, where formal graph grammars implicitly represent large spaces of chemical compounds. We use a graph rewriting formalism, rooted in category theory, called the Double Pushout approach, which directly expresses the transition state of chemical reactions. Using concurrency theory for transformation rules, we define algorithms for the composition of rewrite rules in a chemically intuitive manner that enable automatic abstraction of the level of detail in chemical pathways. Based on this rule composition we define an algorithmic framework for generation of vast reaction networks for specific spaces of a given chemistry, while still maintaining the level of detail of the model down to the atomic level. The framework also allows for computation with graphs and graph grammars, which is utilised to model non-trivial chemical systems. The graph generation relies on graph isomorphism testing, and we review the general individualisation-refinement paradigm used in the state-of-the-art algorithms for graph canonicalisation, isomorphism testing, and automorphism discovery.
We present a model for chemical pathways based on a generalisation of network flows from ordinary directed graphs to directed hypergraphs. The model allows for reasoning about the flow of individual molecules in general pathways, and the introduction of chemically motivated routing constraints. It further provides the foundation for defining specialised pathway motifs, which is illustrated by defining necessary topological constraints for both catalytic and autocatalytic pathways. We also prove that central types of pathway questions are NP-complete, even for restricted classes of reaction networks. The complete pathway model, including constraints for catalytic and autocatalytic pathways, is implemented using integer linear programming. This implementation is used in a tree search method to enumerate both optimal and near-optimal pathway solutions.
The formal methods are applied to multiple chemical systems: the enzyme catalysed beta-lactamase reaction, variations of the glycolysis pathway, and the formose process. In each of these systems we use rule composition to abstract pathways and calculate traces for isotope labelled carbon atoms. The pathway model is used to automatically enumerate alternative non-oxidative glycolysis pathways, and enumerate thousands of candidates for autocatalytic pathways in the formose process