Search CORE

5 research outputs found

Keeping it Simple: Language Models can learn Complex Molecular Distributions

Author: Aspuru-Guzik Alán
Flam-Shepherd Daniel
Zhu Kevin
Publication venue
Publication date: 06/12/2021
Field of study

Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. More sophisticated are graph generative models, which sequentially construct molecular graphs and typically achieve state of the art results. However, recent work has shown that language models are more capable than once thought, particularly in the low data regime. In this work, we investigate the capacity of simple language models to learn distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules. On each task, we evaluate the ability of language models as compared with two widely used graph generative models. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions -- and yield better performance than the graph models. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem

arXiv.org e-Print Archive

PubMed Central

Neural message passing on high order paths

Author: Aspuru-Guzik Alan
Flam-Shepherd Daniel
Friederich Pascal
Wu Tony C.
Publication venue: Institute of Physics Publishing Ltd
Publication date: 04/01/2022
Field of study

KITopen

Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Author: Aspuru-Guzik Alán
Flam-Shepherd Daniel
Publication venue
Publication date: 09/05/2023
Field of study

Language models are powerful tools for molecular design. Currently, the dominant paradigm is to parse molecular graphs into linear string representations that can easily be trained on. This approach has been very successful, however, it is limited to chemical structures that can be completely represented by a graph -- like organic molecules -- while materials and biomolecular structures like protein binding sites require a more complete representation that includes the relative positioning of their atoms in space. In this work, we show how language models, without any architecture modifications, trained using next-token prediction -- can generate novel and valid structures in three dimensions from various substantially different distributions of chemical structures. In particular, we demonstrate that language models trained directly on sequences derived directly from chemical file formats like XYZ files, Crystallographic Information files (CIFs), or Protein Data Bank files (PDBs) can directly generate molecules, crystals, and protein binding sites in three dimensions. Furthermore, despite being trained on chemical file sequences -- language models still achieve performance comparable to state-of-the-art models that use graph and graph-derived string representations, as well as other domain-specific 3D generative models. In doing so, we demonstrate that it is not necessary to use simplified molecular representations to train chemical language models -- that they are powerful generative models capable of directly exploring chemical space in three dimensions for very different structures

arXiv.org e-Print Archive

Learning Interpretable Representations of Entanglement in Quantum Optics Experiments using Deep Generative Models

Author: Aspuru-Guzik Alan
Cervera-Lierta Alba
Flam-Shepherd Daniel
Gu Xuemei
Krenn Mario
Wu Tony
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/06/2022
Field of study

Quantum physics experiments produce interesting phenomena such as interference or entanglement, which are core properties of numerous future quantum technologies. The complex relationship between the setup structure of a quantum experiment and its entanglement properties is essential to fundamental research in quantum optics but is difficult to intuitively understand. We present a deep generative model of quantum optics experiments where a variational autoencoder is trained on a dataset of quantum optics experimental setups. In a series of computational experiments, we investigate the learned representation of our Quantum Optics Variational Auto Encoder (QOVAE) and its internal understanding of the quantum optics world. We demonstrate that the QOVAE learns an interpretable representation of quantum optics experiments and the relationship between experiment structure and entanglement. We show the QOVAE is able to generate novel experiments for highly entangled quantum states with specific distributions that match its training data. The QOVAE can learn to generate specific entangled states and efficiently search the space of experiments that produce highly entangled quantum states. Importantly, we are able to interpret how the QOVAE structures its latent space, finding curious patterns that we can explain in terms of quantum physics. The results demonstrate how we can use and understand the internal representations of deep generative models in a complex scientific domain. The QOVAE and the insights from our investigations can be immediately applied to other physical systems.Comment: Published in Nature Machine Intelligence https://doi.org/10.1038/s42256-022-00493-

arXiv.org e-Print Archive

MPG.PuRe