29 research outputs found
SELFIES and the future of molecular string representations
Artificial intelligence (AI) and machine learning (ML) are expanding in
popularity for broad applications to challenging tasks in chemistry and
materials science. Examples include the prediction of properties, the discovery
of new reaction pathways, or the design of new molecules. The machine needs to
read and write fluently in a chemical language for each of these tasks. Strings
are a common tool to represent molecular graphs, and the most popular molecular
string representation, SMILES, has powered cheminformatics since the late
1980s. However, in the context of AI and ML in chemistry, SMILES has several
shortcomings -- most pertinently, most combinations of symbols lead to invalid
results with no valid chemical interpretation. To overcome this issue, a new
language for molecules was introduced in 2020 that guarantees 100\% robustness:
SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and
enabled numerous new applications in chemistry. In this manuscript, we look to
the future and discuss molecular string representations, along with their
respective opportunities and challenges. We propose 16 concrete Future Projects
for robust molecular representations. These involve the extension toward new
chemical domains, exciting questions at the interface of AI and robust
languages and interpretability for both humans and machines. We hope that these
proposals will inspire several follow-up works exploiting the full potential of
molecular string representations for the future of AI in chemistry and
materials science.Comment: 34 pages, 15 figures, comments and suggestions for additional
references are welcome
SELFIES and the future of molecular string representations
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science
SELFIES and the future of molecular string representations
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science
SELFIES and the future of molecular string representations
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science
Exploring the GDB-13 chemical space using deep generative models
Recent applications of recurrent neural networks (RNN) enable training models that sample the chemical space. In this study we train RNN with molecular string representations (SMILES) with a subset of the enumerated database GDB-13 (975 million molecules). We show that a model trained with 1 million structures (0.1% of the database) reproduces 68.9% of the entire database after training, when sampling 2 billion molecules. We also developed a method to assess the quality of the training process using negative log-likelihood plots. Furthermore, we use a mathematical model based on the “coupon collector problem” that compares the trained model to an upper bound and thus we are able to quantify how much it has learned. We also suggest that this method can be used as a tool to benchmark the learning capabilities of any molecular generative model architecture. Additionally, an analysis of the generated chemical space was performed, which shows that, mostly due to the syntax of SMILES, complex molecules with many rings and heteroatoms are more difficult to sample
Group SELFIES: A Robust Fragment-Based Molecular String Representation
We introduce Group SELFIES, a molecular string representation that leverages
group tokens to represent functional groups or entire substructures while
maintaining chemical robustness guarantees. Molecular string representations,
such as SMILES and SELFIES, serve as the basis for molecular generation and
optimization in chemical language models, deep generative models, and
evolutionary methods. While SMILES and SELFIES leverage atomic representations,
Group SELFIES builds on top of the chemical robustness guarantees of SELFIES by
enabling group tokens, thereby creating additional flexibility to the
representation. Moreover, the group tokens in Group SELFIES can take advantage
of inductive biases of molecular fragments that capture meaningful chemical
motifs. The advantages of capturing chemical motifs and flexibility are
demonstrated in our experiments, which show that Group SELFIES improves
distribution learning of common molecular datasets. Further experiments also
show that random sampling of Group SELFIES strings improves the quality of
generated molecules compared to regular SELFIES strings. Our open-source
implementation of Group SELFIES is available online, which we hope will aid
future research in molecular generation and optimization.Comment: 11 pages + references and appendi
Matrix of orthogonalized atomic orbital coefficients representation for radicals and ions
Chemical (molecular, quantum) machine learning relies on representing molecules in unique and informative ways. Here, we present the matrix of orthogonalized atomic orbital coefficients (MAOC) as a quantum-inspired molecular and atomic representation containing both structural (composition and geometry) and electronic (charge and spin multiplicity) information. MAOC is based on a cost-effective localization scheme that represents localized orbitals via a predefined set of atomic orbitals. The latter can be constructed from such small atom-centered basis sets as pcseg-0 and STO-3G in conjunction with guess (non-optimized) electronic configuration of the molecule. Importantly, MAOC is suitable for representing monatomic, molecular, and periodic systems and can distinguish compounds with identical compositions and geometries but distinct charges and spin multiplicities. Using principal component analysis, we constructed a more compact but equally powerful version of MAOC—PCX-MAOC. To test the performance of full and reduced MAOC and several other representations (CM, SOAP, SLATM, and SPAHM), we used a kernel ridge regression machine learning model to predict frontier molecular orbital energy levels and ground state single-point energies for chemically diverse neutral and charged, closed- and open-shell molecules from an extended QM7b dataset, as well as two new datasets, N-HPC-1 (N-heteropolycycles) and REDOX (nitroxyl and phenoxyl radicals, carbonyl, and cyano compounds). MAOC affords accuracy that is either similar or superior to other representations for a range of chemical properties and systems
HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations
Publicly available collections of drug-like molecules have grown to comprise
10s of billions of possibilities in recent history due to advances in chemical
synthesis. Traditional methods for identifying ``hit'' molecules from a large
collection of potential drug-like candidates have relied on biophysical theory
to compute approximations to the Gibbs free energy of the binding interaction
between the drug to its protein target. A major drawback of the approaches is
that they require exceptional computing capabilities to consider for even
relatively small collections of molecules.
Hyperdimensional Computing (HDC) is a recently proposed learning paradigm
that is able to leverage low-precision binary vector arithmetic to build
efficient representations of the data that can be obtained without the need for
gradient-based optimization approaches that are required in many conventional
machine learning and deep learning approaches. This algorithmic simplicity
allows for acceleration in hardware that has been previously demonstrated for a
range of application areas. We consider existing HDC approaches for molecular
property classification and introduce two novel encoding algorithms that
leverage the extended connectivity fingerprint (ECFP) algorithm.
We show that HDC-based inference methods are as much as 90 times more
efficient than more complex representative machine learning methods and achieve
an acceleration of nearly 9 orders of magnitude as compared to inference with
molecular docking. We demonstrate multiple approaches for the encoding of
molecular data for HDC and examine their relative performance on a range of
challenging molecular property prediction and drug-protein binding
classification tasks. Our work thus motivates further investigation into
molecular representation learning to develop ultra-efficient pre-screening
tools