16 research outputs found
Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space
Challenges in natural sciences can often be phrased as optimization problems.
Machine learning techniques have recently been applied to solve such problems.
One example in chemistry is the design of tailor-made organic materials and
molecules, which requires efficient methods to explore the chemical space. We
present a genetic algorithm (GA) that is enhanced with a neural network (DNN)
based discriminator model to improve the diversity of generated molecules and
at the same time steer the GA. We show that our algorithm outperforms other
generative models in optimization tasks. We furthermore present a way to
increase interpretability of genetic algorithms, which helped us to derive
design principles.Comment: 9+3 Pages, 7+4 figures, 2 tables. Comments are welcome! (code is
available at: https://github.com/aspuru-guzik-group/GA
Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation
The discovery of novel materials and functional molecules can help to solve
some of society's most urgent challenges, ranging from efficient energy
harvesting and storage to uncovering novel pharmaceutical drug candidates.
Traditionally matter engineering -- generally denoted as inverse design -- was
based massively on human intuition and high-throughput virtual screening. The
last few years have seen the emergence of significant interest in
computer-inspired designs based on evolutionary or deep learning methods. The
major challenge here is that the standard strings molecular representation
SMILES shows substantial weaknesses in that task because large fractions of
strings do not correspond to valid molecules. Here, we solve this problem at a
fundamental level and introduce SELFIES (SELF-referencIng Embedded Strings), a
string-based representation of molecules which is 100\% robust. Every SELFIES
string corresponds to a valid molecule, and SELFIES can represent every
molecule. SELFIES can be directly applied in arbitrary machine learning models
without the adaptation of the models; each of the generated molecule candidates
is valid. In our experiments, the model's internal memory stores two orders of
magnitude more diverse molecules than a similar test with SMILES. Furthermore,
as all molecules are valid, it allows for explanation and interpretation of the
internal working of the generative models.Comment: 6+3 pages, 6+1 figure
Recent advances in the Self-Referencing Embedding Strings (SELFIES) library
String-based molecular representations play a crucial role in cheminformatics
applications, and with the growing success of deep learning in chemistry, have
been readily adopted into machine learning pipelines. However, traditional
string-based representations such as SMILES are often prone to syntactic and
semantic errors when produced by generative models. To address these problems,
a novel representation, SELF-referencIng Embedded Strings (SELFIES), was
proposed that is inherently 100% robust, alongside an accompanying open-source
implementation. Since then, we have generalized SELFIES to support a wider
range of molecules and semantic constraints and streamlined its underlying
grammar. We have implemented this updated representation in subsequent versions
of \selfieslib, where we have also made major advances with respect to design,
efficiency, and supported features. Hence, we present the current status of
\selfieslib (version 2.1.1) in this manuscript.Comment: 11 pages, 2 figure
Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design
The efficient exploration of chemical space to design molecules with intended
properties enables the accelerated discovery of drugs, materials, and
catalysts, and is one of the most important outstanding challenges in
chemistry. Encouraged by the recent surge in computer power and artificial
intelligence development, many algorithms have been developed to tackle this
problem. However, despite the emergence of many new approaches in recent years,
comparatively little progress has been made in developing realistic benchmarks
that reflect the complexity of molecular design for real-world applications. In
this work, we develop a set of practical benchmark tasks relying on physical
simulation of molecular systems mimicking real-life molecular design problems
for materials, drugs, and chemical reactions. Additionally, we demonstrate the
utility and ease of use of our new benchmark set by demonstrating how to
compare the performance of several well-established families of algorithms.
Surprisingly, we find that model performance can strongly depend on the
benchmark domain. We believe that our benchmark suite will help move the field
towards more realistic molecular design benchmarks, and move the development of
inverse molecular design algorithms closer to designing molecules that solve
existing problems in both academia and industry alike.Comment: 29+21 pages, 6+19 figures, 6+2 table
On scientific understanding with artificial intelligence
Imagine an oracle that correctly predicts the outcome of every particle
physics experiment, the products of every chemical reaction, or the function of
every protein. Such an oracle would revolutionize science and technology as we
know them. However, as scientists, we would not be satisfied with the oracle
itself. We want more. We want to comprehend how the oracle conceived these
predictions. This feat, denoted as scientific understanding, has frequently
been recognized as the essential aim of science. Now, the ever-growing power of
computers and artificial intelligence poses one ultimate question: How can
advanced artificial systems contribute to scientific understanding or achieve
it autonomously?
We are convinced that this is not a mere technical question but lies at the
core of science. Therefore, here we set out to answer where we are and where we
can go from here. We first seek advice from the philosophy of science to
understand scientific understanding. Then we review the current state of the
art, both from literature and by collecting dozens of anecdotes from scientists
about how they acquired new conceptual understanding with the help of
computers. Those combined insights help us to define three dimensions of
android-assisted scientific understanding: The android as a I) computational
microscope, II) resource of inspiration and the ultimate, not yet existent III)
agent of understanding. For each dimension, we explain new avenues to push
beyond the status quo and unleash the full power of artificial intelligence's
contribution to the central aim of science. We hope our perspective inspires
and focuses research towards androids that get new scientific understanding and
ultimately bring us closer to true artificial scientists.Comment: 13 pages, 3 figures, comments welcome
SELFIES and the future of molecular string representations
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science
SELFIES and the future of molecular string representations
Artificial intelligence (AI) and machine learning (ML) are expanding in
popularity for broad applications to challenging tasks in chemistry and
materials science. Examples include the prediction of properties, the discovery
of new reaction pathways, or the design of new molecules. The machine needs to
read and write fluently in a chemical language for each of these tasks. Strings
are a common tool to represent molecular graphs, and the most popular molecular
string representation, SMILES, has powered cheminformatics since the late
1980s. However, in the context of AI and ML in chemistry, SMILES has several
shortcomings -- most pertinently, most combinations of symbols lead to invalid
results with no valid chemical interpretation. To overcome this issue, a new
language for molecules was introduced in 2020 that guarantees 100\% robustness:
SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and
enabled numerous new applications in chemistry. In this manuscript, we look to
the future and discuss molecular string representations, along with their
respective opportunities and challenges. We propose 16 concrete Future Projects
for robust molecular representations. These involve the extension toward new
chemical domains, exciting questions at the interface of AI and robust
languages and interpretability for both humans and machines. We hope that these
proposals will inspire several follow-up works exploiting the full potential of
molecular string representations for the future of AI in chemistry and
materials science.Comment: 34 pages, 15 figures, comments and suggestions for additional
references are welcome
SELFIES and the future of molecular string representations
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science