3 research outputs found
From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction
Foundation models have been transformational in machine learning fields such
as natural language processing and computer vision. Similar success in atomic
property prediction has been limited due to the challenges of training
effective models across multiple chemical domains. To address this, we
introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training
strategy that simultaneously trains on multiple datasets from different
chemical domains, treating each dataset as a unique pre-training task within a
multi-task framework. Our combined training dataset consists of 120M
systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and
generalization by fine-tuning over a diverse set of downstream tasks and
datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP
demonstrates an average improvement of 59% over training from scratch, and
matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the
potential of pre-training strategies that utilize diverse data to advance
property prediction across chemical domains, especially for low-data tasks
Open Challenges in Developing Generalizable Large Scale Machine Learning Models for Catalyst Discovery
The development of machine learned potentials for catalyst discovery has
predominantly been focused on very specific chemistries and material
compositions. While effective in interpolating between available materials,
these approaches struggle to generalize across chemical space. The recent
curation of large-scale catalyst datasets has offered the opportunity to build
a universal machine learning potential, spanning chemical and composition
space. If accomplished, said potential could accelerate the catalyst discovery
process across a variety of applications (CO2 reduction, NH3 production, etc.)
without additional specialized training efforts that are currently required.
The release of the Open Catalyst 2020 (OC20) has begun just that, pushing the
heterogeneous catalysis and machine learning communities towards building more
accurate and robust models. In this perspective, we discuss some of the
challenges and findings of recent developments on OC20. We examine the
performance of current models across different materials and adsorbates to
identify notably underperforming subsets. We then discuss some of the modeling
efforts surrounding energy-conservation, approaches to finding and evaluating
the local minima, and augmentation of off-equilibrium data. To complement the
community's ongoing developments, we end with an outlook to some of the
important challenges that have yet to be thoroughly explored for large-scale
catalyst discovery.Comment: submitted to ACS Catalysi
The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts
The development of machine learning models for electrocatalysts requires a
broad set of training data to enable their use across a wide variety of
materials. One class of materials that currently lacks sufficient training data
is oxides, which are critical for the development of Oxygen Evolution Reaction
(OER) catalysts. To address this, we developed the Open Catalyst 2022 (OC22)
dataset, consisting of 62,331 Density Functional Theory (DFT) relaxations
(~9,854,504 single point calculations) across a range of oxide materials,
coverages, and adsorbates. We define generalized total energy tasks that enable
property prediction beyond adsorption energies; we test baseline performance of
several graph neural networks; and we provide pre-defined dataset splits to
establish clear benchmarks for future efforts. In the most general task,
GemNet-OC sees a ~32% improvement in energy predictions when combining the
chemically dissimilar Open Catalyst 2020 Dataset (OC20) and OC22 datasets via
fine-tuning. Similarly, we achieved a ~19% improvement in total energy
predictions on OC20 and a ~9% improvement in force predictions in OC22 when
using joint training. We demonstrate the practical utility of a top performing
model by capturing literature adsorption energies and important OER scaling
relationships. We expect OC22 to provide an important benchmark for models
seeking to incorporate intricate long-range electrostatic and magnetic
interactions in oxide surfaces. The dataset and baseline models are open
sourced, and a public leaderboard has been made available to encourage
continued community developments on the total energy tasks and data.Comment: 48 pages, 14 figure