63 research outputs found
Chemical Properties from Graph Neural Network-Predicted Electron Densities
According to density functional theory, any chemical property can be inferred
from the electron density, making it the most informative attribute of an
atomic structure. In this work, we demonstrate the use of established physical
methods to obtain important chemical properties from model-predicted electron
densities. We introduce graph neural network architectural choices that provide
physically relevant and useful electron density predictions. Despite not
training to predict atomic charges, the model is able to predict atomic charges
with an order of magnitude lower error than a sum of atomic charge densities.
Similarly, the model predicts dipole moments with half the error of the sum of
atomic charge densities method. We demonstrate that larger data sets lead to
more useful predictions in these tasks. These results pave the way for an
alternative path in atomistic machine learning, where data-driven approaches
and existing physical methods are used in tandem to obtain a variety of
chemical properties in an explainable and self-consistent manner
Fine-Tuned Language Models Generate Stable Inorganic Materials as Text
We propose fine-tuning large language models for generation of stable
materials. While unorthodox, fine-tuning large language models on text-encoded
atomistic data is simple to implement yet reliable, with around 90% of sampled
structures obeying physical constraints on atom positions and charges. Using
energy above hull calculations from both learned ML potentials and
gold-standard DFT calculations, we show that our strongest model (fine-tuned
LLaMA-2 70B) can generate materials predicted to be metastable at about twice
the rate (49% vs 28%) of CDVAE, a competing diffusion model. Because of text
prompting's inherent flexibility, our models can simultaneously be used for
unconditional generation of stable material, infilling of partial structures
and text-conditional generation. Finally, we show that language models' ability
to capture key symmetries of crystal structures improves with model scale,
suggesting that the biases of pretrained LLMs are surprisingly well-suited for
atomistic data.Comment: ICLR 2024. Code available at:
https://github.com/facebookresearch/crystal-ll
GemNet-OC: Developing Graph Neural Networks for Large and Diverse Molecular Simulation Datasets
Recent years have seen the advent of molecular simulation datasets that are
orders of magnitude larger and more diverse. These new datasets differ
substantially in four aspects of complexity: 1. Chemical diversity (number of
different elements), 2. system size (number of atoms per sample), 3. dataset
size (number of data samples), and 4. domain shift (similarity of the training
and test set). Despite these large differences, benchmarks on small and narrow
datasets remain the predominant method of demonstrating progress in graph
neural networks (GNNs) for molecular simulation, likely due to cheaper training
compute requirements. This raises the question -- does GNN progress on small
and narrow datasets translate to these more complex datasets? This work
investigates this question by first developing the GemNet-OC model based on the
large Open Catalyst 2020 (OC20) dataset. GemNet-OC outperforms the previous
state-of-the-art on OC20 by 16% while reducing training time by a factor of 10.
We then compare the impact of 18 model components and hyperparameter choices on
performance in multiple datasets. We find that the resulting model would be
drastically different depending on the dataset used for making model choices.
To isolate the source of this discrepancy we study six subsets of the OC20
dataset that individually test each of the above-mentioned four dataset
aspects. We find that results on the OC-2M subset correlate well with the full
OC20 dataset while being substantially cheaper to train on. Our findings
challenge the common practice of developing GNNs solely on small datasets, but
highlight ways of achieving fast development cycles and generalizable results
via moderately-sized, representative datasets such as OC-2M and efficient
models such as GemNet-OC. Our code and pretrained model weights are
open-sourced
From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction
Foundation models have been transformational in machine learning fields such
as natural language processing and computer vision. Similar success in atomic
property prediction has been limited due to the challenges of training
effective models across multiple chemical domains. To address this, we
introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training
strategy that simultaneously trains on multiple datasets from different
chemical domains, treating each dataset as a unique pre-training task within a
multi-task framework. Our combined training dataset consists of 120M
systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and
generalization by fine-tuning over a diverse set of downstream tasks and
datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP
demonstrates an average improvement of 59% over training from scratch, and
matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the
potential of pre-training strategies that utilize diverse data to advance
property prediction across chemical domains, especially for low-data tasks
The Open DAC 2023 Dataset and Challenges for Sorbent Discovery in Direct Air Capture
New methods for carbon dioxide removal are urgently needed to combat global
climate change. Direct air capture (DAC) is an emerging technology to capture
carbon dioxide directly from ambient air. Metal-organic frameworks (MOFs) have
been widely studied as potentially customizable adsorbents for DAC. However,
discovering promising MOF sorbents for DAC is challenging because of the vast
chemical space to explore and the need to understand materials as functions of
humidity and temperature. We explore a computational approach benefiting from
recent innovations in machine learning (ML) and present a dataset named Open
DAC 2023 (ODAC23) consisting of more than 38M density functional theory (DFT)
calculations on more than 8,400 MOF materials containing adsorbed and/or
. ODAC23 is by far the largest dataset of MOF adsorption calculations at
the DFT level of accuracy currently available. In addition to probing
properties of adsorbed molecules, the dataset is a rich source of information
on structural relaxation of MOFs, which will be useful in many contexts beyond
specific applications for DAC. A large number of MOFs with promising properties
for DAC are identified directly in ODAC23. We also trained state-of-the-art ML
models on this dataset to approximate calculations at the DFT level. This
open-source dataset and our initial ML models will provide an important
baseline for future efforts to identify MOFs for a wide range of applications,
including DAC
AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning Potentials
Computational catalysis is playing an increasingly significant role in the
design of catalysts across a wide range of applications. A common task for many
computational methods is the need to accurately compute the adsorption energy
for an adsorbate and a catalyst surface of interest. Traditionally, the
identification of low energy adsorbate-surface configurations relies on
heuristic methods and researcher intuition. As the desire to perform
high-throughput screening increases, it becomes challenging to use heuristics
and intuition alone. In this paper, we demonstrate machine learning potentials
can be leveraged to identify low energy adsorbate-surface configurations more
accurately and efficiently. Our algorithm provides a spectrum of trade-offs
between accuracy and efficiency, with one balanced option finding the lowest
energy configuration 87.36% of the time, while achieving a 2000x speedup in
computation. To standardize benchmarking, we introduce the Open Catalyst Dense
dataset containing nearly 1,000 diverse surfaces and 100,000 unique
configurations.Comment: 26 pages, 7 figures. Submitted to npj Computational Material
- …