Search CORE

24 research outputs found

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

KITopen

SELFIES and the future of molecular string representations

arXiv.org e-Print Archive

MPG.PuRe

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

arXiv.org e-Print Archive

VU Research Portal

Proceedings - University of Groningen

KITopen

ARTS repository - University of Groningen

PubMed Central

eScholarship - University of California

MPG.PuRe

Dissertations of the University of Groningen

DARLING: Deep leARning for chemicaL Information processinG

Author: Rajan Kohulan
Publication venue
Publication date: 01/01/2021
Field of study

Vast quantities of scientific information are hidden in primary scientific publications and not available as curated data in scientific databases. Making such information publicly available to support open science and open innovation is a challenge that has to be solved. In this dissertation, state-of-the-art deep learning models for optical chemical structure recognition and chemical information processing have been implemented to rediscover this information and retrieve it automatically

Digitale Bibliothek Thüringen

Recent advancements in DECIMER.ai and automated mining of chemical literature for COCONUT

Author: Kohulan Rajan
Publication venue
Publication date: 05/09/2023
Field of study

This presentation highlights the latest developments in DECIMER.ai and its automated mining of chemical literature for COCONUT. It was presented at two prestigious events: the ACS Spring Conference 2023 in San Francisco, USA, and the 6th Artificial Intelligence in Chemistry Symposium in Cambridge, UK

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

RanDepict

Author: Brinkhaus Henning Otto
Rajan Kohulan
Publication venue
Publication date: 22/09/2023
Field of study

1.3.0 (2023-09-22) Features automated_pypi_releases (deeaa91)If you use this software, please cite it as below

ZENODO

STOUT: SMILES to IUPAC Names Using Neural Machine Translation

Author: Achim Zielesny
Christoph Steinbeck
Kohulan Rajan
Publication venue
Publication date: 23/03/2021
Field of study

Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this rule set a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e., predicting the SMILES string from the IUPAC name. The open system demonstrates a test accuracy of about 90% correct predictions, also incorrect predictions show a remarkable similarity between true and predicted compounds.</p

ChemRxiv

DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers

Author: Achim Zielesny
Christoph Steinbeck
Kohulan Rajan
Publication venue
Publication date: 23/07/2021
Field of study

The amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, optical chemical structure recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50-100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information

ChemRxiv

Performance of chemical structure string representations for chemical image recognition using transformers

Author: Achim Zielesny
Christoph Steinbeck
Kohulan Rajan
Publication venue
Publication date: 22/10/2021
Field of study

The use of molecular string representations for deep learning in chemistry has been steadily increasing in recent years. The complexity of existing string representations, and the difficulty in creating meaningful tokens from them, lead to the development of new string representations for chemical structures. In this study, the translation of chemical structure depictions in the form of bitmap images to corresponding molecular string representations was examined. An analysis of the recently developed DeepSMILES and SELFIES representations in comparison with the most commonly used SMILES representation is presented where the ability to translate image features into string representations with transformer models was specifically tested. The SMILES representation exhibits the best overall performance whereas SELFIES guarantee valid chemical structures. DeepSMILES performs in between SMILES and SELFIES, InChIs are not appropriate for the learning task. All investigations were carried out with publicly available datasets and the code used to train and evaluate the models has been made available to the public

ChemRxiv

DECIMER - Towards Deep Learning for Chemical Image Recognition

Author: Achim Zielesny
Christoph Steinbeck
Kohulan Rajan
Publication venue
Publication date: 20/08/2020
Field of study

The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-access databases. Here we report preliminary findings in our development of DECIMER (Deep lEarning for Chemical ImagE Recognition), a deep learning method based on existing show-and-tell deep neural networks which makes very few assumptions about the structure of the underlying problem. The training state reported here does not yet rival the performance of existing traditional approaches, but we present evidence that our method will reach a comparable detection power with sufficient training time. Training success of DECIMER depends on the input data representation: DeepSMILES are clearly superior over SMILES and we have preliminary indication that the recently reported SELFIES outperform DeepSMILES. An extrapolation of our results towards larger training data sizes suggest that we might be able to achieve >90% accuracy with about 60 to 100 million training structures, so that training can be completed within several months on a single GPU. This work is completely based on open-source software and open data and is available to the general public for any purpose

ChemRxiv