5 research outputs found
Keeping it Simple: Language Models can learn Complex Molecular Distributions
Deep generative models of molecules have grown immensely in popularity,
trained on relevant datasets, these models are used to search through chemical
space. The downstream utility of generative models for the inverse design of
novel functional compounds depends on their ability to learn a training
distribution of molecules. The most simple example is a language model that
takes the form of a recurrent neural network and generates molecules using a
string representation. More sophisticated are graph generative models, which
sequentially construct molecular graphs and typically achieve state of the art
results. However, recent work has shown that language models are more capable
than once thought, particularly in the low data regime. In this work, we
investigate the capacity of simple language models to learn distributions of
molecules. For this purpose, we introduce several challenging generative
modeling tasks by compiling especially complex distributions of molecules. On
each task, we evaluate the ability of language models as compared with two
widely used graph generative models. The results demonstrate that language
models are powerful generative models, capable of adeptly learning complex
molecular distributions -- and yield better performance than the graph models.
Language models can accurately generate: distributions of the highest scoring
penalized LogP molecules in ZINC15, multi-modal molecular distributions as well
as the largest molecules in PubChem
Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files
Language models are powerful tools for molecular design. Currently, the
dominant paradigm is to parse molecular graphs into linear string
representations that can easily be trained on. This approach has been very
successful, however, it is limited to chemical structures that can be
completely represented by a graph -- like organic molecules -- while materials
and biomolecular structures like protein binding sites require a more complete
representation that includes the relative positioning of their atoms in space.
In this work, we show how language models, without any architecture
modifications, trained using next-token prediction -- can generate novel and
valid structures in three dimensions from various substantially different
distributions of chemical structures. In particular, we demonstrate that
language models trained directly on sequences derived directly from chemical
file formats like XYZ files, Crystallographic Information files (CIFs), or
Protein Data Bank files (PDBs) can directly generate molecules, crystals, and
protein binding sites in three dimensions. Furthermore, despite being trained
on chemical file sequences -- language models still achieve performance
comparable to state-of-the-art models that use graph and graph-derived string
representations, as well as other domain-specific 3D generative models. In
doing so, we demonstrate that it is not necessary to use simplified molecular
representations to train chemical language models -- that they are powerful
generative models capable of directly exploring chemical space in three
dimensions for very different structures
Learning Interpretable Representations of Entanglement in Quantum Optics Experiments using Deep Generative Models
Quantum physics experiments produce interesting phenomena such as
interference or entanglement, which are core properties of numerous future
quantum technologies. The complex relationship between the setup structure of a
quantum experiment and its entanglement properties is essential to fundamental
research in quantum optics but is difficult to intuitively understand. We
present a deep generative model of quantum optics experiments where a
variational autoencoder is trained on a dataset of quantum optics experimental
setups. In a series of computational experiments, we investigate the learned
representation of our Quantum Optics Variational Auto Encoder (QOVAE) and its
internal understanding of the quantum optics world. We demonstrate that the
QOVAE learns an interpretable representation of quantum optics experiments and
the relationship between experiment structure and entanglement. We show the
QOVAE is able to generate novel experiments for highly entangled quantum states
with specific distributions that match its training data. The QOVAE can learn
to generate specific entangled states and efficiently search the space of
experiments that produce highly entangled quantum states. Importantly, we are
able to interpret how the QOVAE structures its latent space, finding curious
patterns that we can explain in terms of quantum physics. The results
demonstrate how we can use and understand the internal representations of deep
generative models in a complex scientific domain. The QOVAE and the insights
from our investigations can be immediately applied to other physical systems.Comment: Published in Nature Machine Intelligence
https://doi.org/10.1038/s42256-022-00493-