Search CORE

3 research outputs found

From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction

Author: Kitchin John R.
Kolluru Adeesh
Shoghi Nima
Ulissi Zachary W.
Wood Brandon M.
Zitnick C. Lawrence
Publication venue
Publication date: 25/10/2023
Field of study

Foundation models have been transformational in machine learning fields such as natural language processing and computer vision. Similar success in atomic property prediction has been limited due to the challenges of training effective models across multiple chemical domains. To address this, we introduce Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy that simultaneously trains on multiple datasets from different chemical domains, treating each dataset as a unique pre-training task within a multi-task framework. Our combined training dataset consists of

\sim

120M systems from OC20, OC22, ANI-1x, and Transition-1x. We evaluate performance and generalization by fine-tuning over a diverse set of downstream tasks and datasets including: QM9, rMD17, MatBench, QMOF, SPICE, and MD22. JMP demonstrates an average improvement of 59% over training from scratch, and matches or sets state-of-the-art on 34 out of 40 tasks. Our work highlights the potential of pre-training strategies that utilize diverse data to advance property prediction across chemical domains, especially for low-data tasks

arXiv.org e-Print Archive

Open Challenges in Developing Generalizable Large Scale Machine Learning Models for Catalyst Discovery

Author: Das Abhishek
Kitchin John R
Kolluru Adeesh
Palizhati Aini
Shoghi Nima
Shuaibi Muhammed
Ulissi Zachary W
Wood Brandon
Zitnick C. Lawrence
Publication venue
Publication date: 13/06/2022
Field of study

The development of machine learned potentials for catalyst discovery has predominantly been focused on very specific chemistries and material compositions. While effective in interpolating between available materials, these approaches struggle to generalize across chemical space. The recent curation of large-scale catalyst datasets has offered the opportunity to build a universal machine learning potential, spanning chemical and composition space. If accomplished, said potential could accelerate the catalyst discovery process across a variety of applications (CO2 reduction, NH3 production, etc.) without additional specialized training efforts that are currently required. The release of the Open Catalyst 2020 (OC20) has begun just that, pushing the heterogeneous catalysis and machine learning communities towards building more accurate and robust models. In this perspective, we discuss some of the challenges and findings of recent developments on OC20. We examine the performance of current models across different materials and adsorbates to identify notably underperforming subsets. We then discuss some of the modeling efforts surrounding energy-conservation, approaches to finding and evaluating the local minima, and augmentation of off-equilibrium data. To complement the community's ongoing developments, we end with an outlook to some of the important challenges that have yet to be thoroughly explored for large-scale catalyst discovery.Comment: submitted to ACS Catalysi

arXiv.org e-Print Archive

The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts

Author: Abed Jehad
Das Abhishek
Goyal Siddharth
Heras-Domingo Javier
Kolluru Adeesh
Lan Janice
Rizvi Ammar
Sargent Edward H.
Shoghi Nima
Shuaibi Muhammed
Sriram Anuroop
Therrien Felix
Tran Richard
Ulissi Zachary
Voznyy Oleksandr
Wood Brandon M.
Zitnick C. Lawrence
Publication venue
Publication date: 04/11/2022
Field of study

The development of machine learning models for electrocatalysts requires a broad set of training data to enable their use across a wide variety of materials. One class of materials that currently lacks sufficient training data is oxides, which are critical for the development of Oxygen Evolution Reaction (OER) catalysts. To address this, we developed the Open Catalyst 2022 (OC22) dataset, consisting of 62,331 Density Functional Theory (DFT) relaxations (~9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates. We define generalized total energy tasks that enable property prediction beyond adsorption energies; we test baseline performance of several graph neural networks; and we provide pre-defined dataset splits to establish clear benchmarks for future efforts. In the most general task, GemNet-OC sees a ~32% improvement in energy predictions when combining the chemically dissimilar Open Catalyst 2020 Dataset (OC20) and OC22 datasets via fine-tuning. Similarly, we achieved a ~19% improvement in total energy predictions on OC20 and a ~9% improvement in force predictions in OC22 when using joint training. We demonstrate the practical utility of a top performing model by capturing literature adsorption energies and important OER scaling relationships. We expect OC22 to provide an important benchmark for models seeking to incorporate intricate long-range electrostatic and magnetic interactions in oxide surfaces. The dataset and baseline models are open sourced, and a public leaderboard has been made available to encourage continued community developments on the total energy tasks and data.Comment: 48 pages, 14 figure

arXiv.org e-Print Archive