Search CORE

486 research outputs found

Prediction of Atomization Energy Using Graph Kernel and Active Learning

Author: de Jong Wibe A.
Tang Yu-Hang
Publication venue: 'AIP Publishing'
Publication date: 01/01/2019
Field of study

Data-driven prediction of molecular properties presents unique challenges to the design of machine learning methods concerning data structure/dimensionality, symmetry adaption, and confidence management. In this paper, we present a kernel-based pipeline that can learn and predict the atomization energy of molecules with high accuracy. The framework employs Gaussian process regression to perform predictions based on the similarity between molecules, which is computed using the marginalized graph kernel. To apply the marginalized graph kernel, a spatial adjacency rule is first employed to convert molecules into graphs whose vertices and edges are labeled by elements and interatomic distances, respectively. We then derive formulas for the efficient evaluation of the kernel. Specific functional components for the marginalized graph kernel are proposed, while the effect of the associated hyperparameters on accuracy and predictive confidence are examined. We show that the graph kernel is particularly suitable for predicting extensive properties because its convolutional structure coincides with that of the covariance formula between sums of random variables. Using an active learning procedure, we demonstrate that the proposed method can achieve a mean absolute error of 0.62 +- 0.01 kcal/mol using as few as 2000 training samples on the QM7 data set

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Machine Learning, Quantum Mechanics, and Chemical Compound Space

We review recent studies dealing with the generation of machine learning models of molecular and solid properties. The models are trained and validated using standard quantum chemistry results obtained for organic molecules and materials selected from chemical space at random

arXiv.org e-Print Archive

Crossref

edoc

Hybrid localized graph kernel for machine learning energy-related properties of molecules and solids

Author: Badawi Michael
Bučko Tomáš
Casier Bastien
da Silva Mauricio Chagas
Lebègue Sébastien
Pascale Fabien
Rocca Dario
Publication venue
Publication date: 14/11/2020
Field of study

Nowadays, the coupling of electronic structure and machine learning techniques serves as a powerful tool to predict chemical and physical properties of a broad range of systems. With the aim of improving the accuracy of predictions, a large number of representations for molecules and solids for machine learning applications has been developed. In this work we propose a novel descriptor based on the notion of molecular graph. While graphs are largely employed in classification problems in cheminformatics or bioinformatics, they are not often used in regression problem, especially of energy-related properties. Our method is based on a local decomposition of atomic environments and on the hybridization of two kernel functions: a graph kernel contribution that describes the chemical pattern and a Coulomb label contribution that 1encodes finer details of the local geometry. The accuracy of this new kernel method in energy predictions of molecular and condensed phase systems is demonstrated by considering the popular QM7 and BA10 datasets. These examples show that the hybrid localized graph kernel outperforms traditional approaches such as, for example, the smooth overlap of atomic positions (SOAP) and the Coulomb matrices

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

Prediction of the Atomization Energy of Molecules Using Coulomb Matrix and Atomic Composition in a Bayesian Regularized Neural Networks

Author: A Mauri
AK Rappé
D Xue
DJC MacKay
DJC Mackay
DJC Mackay
FR Burden
G Hugh
G Montavon
K Hansen
KA Aho
Kishore K. Reddy
LC Blum
M Rupp
M Rupp
OA Lilienfeld Von
R Guha
SJ Gorzynski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/04/2019
Field of study

Exact calculation of electronic properties of molecules is a fundamental step for intelligent and rational compounds and materials design. The intrinsically graph-like and non-vectorial nature of molecular data generates a unique and challenging machine learning problem. In this paper we embrace a learning from scratch approach where the quantum mechanical electronic properties of molecules are predicted directly from the raw molecular geometry, similar to some recent works. But, unlike these previous endeavors, our study suggests a benefit from combining molecular geometry embedded in the Coulomb matrix with the atomic composition of molecules. Using the new combined features in a Bayesian regularized neural networks, our results improve well-known results from the literature on the QM7 dataset from a mean absolute error of 3.51 kcal/mol down to 3.0 kcal/mol.Comment: Under review ICANN 201

arXiv.org e-Print Archive

Crossref

On minimizing the training set fill distance in machine learning regression

Author: Climaco Paolo
Garcke Jochen
Publication venue
Publication date: 05/12/2023
Field of study

For regression tasks one often leverages large datasets for training predictive machine learning models. However, using large datasets may not be feasible due to computational limitations or high data labelling costs. Therefore, suitably selecting small training sets from large pools of unlabelled data points is essential to maximize model performance while maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a data selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error, conditional to the location of the unlabelled data points, that linearly depends on the training set fill distance. For empirical validation, we perform experiments using two regression models on three datasets. We empirically show that selecting a training set by aiming to minimize the fill distance, thereby minimizing our derived bound, significantly reduces the maximum prediction error of various regression models, outperforming alternative sampling approaches by a large margin. Furthermore, we show that selecting training sets with the FPS can also increase model stability for the specific case of Gaussian kernel regression approaches

arXiv.org e-Print Archive

Uncertainty Quantification Using Neural Networks for Molecular Property Prediction

Author: Barzilay Regina
Coley Connor W.
Hirschfeld Lior
Swanson Kyle
Yang Kevin
Publication venue
Publication date: 20/05/2020
Field of study

Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where unanticipated imprecision wastes valuable time and resources. The need for UQ is especially acute for neural models, which are becoming increasingly standard yet are challenging to interpret. While several approaches to UQ have been proposed in the literature, there is no clear consensus on the comparative performance of these models. In this paper, we study this question in the context of regression tasks. We systematically evaluate several methods on five benchmark datasets using multiple complementary performance metrics. Our experiments show that none of the methods we tested is unequivocally superior to all others, and none produces a particularly reliable ranking of errors across multiple datasets. While we believe these results show that existing UQ methods are not sufficient for all common use-cases and demonstrate the benefits of further research, we conclude with a practical recommendation as to which existing techniques seem to perform well relative to others

arXiv.org e-Print Archive

DSpace@MIT

Ab initio machine learning in chemical compound space

Author: Huang Bing
von Lilienfeld O. Anatole
Publication venue
Publication date: 01/01/2021
Field of study

Chemical compound space (CCS), the set of all theoretically conceivable combinations of chemical elements and (meta-)stable geometries that make up matter, is colossal. The first principles based virtual sampling of this space, for example in search of novel molecules or materials which exhibit desirable properties, is therefore prohibitive for all but the smallest sub-sets and simplest properties. We review studies aimed at tackling this challenge using modern machine learning techniques based on (i) synthetic data, typically generated using quantum mechanics based methods, and (ii) model architectures inspired by quantum mechanics. Such Quantum mechanics based Machine Learning (QML) approaches combine the numerical efficiency of statistical surrogate models with an {\em ab initio} view on matter. They rigorously reflect the underlying physics in order to reach universality and transferability across CCS. While state-of-the-art approximations to quantum problems impose severe computational bottlenecks, recent QML based developments indicate the possibility of substantial acceleration without sacrificing the predictive power of quantum mechanics

arXiv.org e-Print Archive

edoc

PubMed Central

HAL Descartes

On the Interplay of Subset Selection and Informed Graph Neural Networks

Author: Breustedt Niklas
Climaco Paolo
Garcke Jochen
Hamaekers Jan
Kutyniok Gitta
Lorenz Dirk A.
Oerder Rick
Shukla Chirag Varun
Publication venue
Publication date: 15/06/2023
Field of study

Machine learning techniques paired with the availability of massive datasets dramatically enhance our ability to explore the chemical compound space by providing fast and accurate predictions of molecular properties. However, learning on large datasets is strongly limited by the availability of computational resources and can be infeasible in some scenarios. Moreover, the instances in the datasets may not yet be labelled and generating the labels can be costly, as in the case of quantum chemistry computations. Thus, there is a need to select small training subsets from large pools of unlabelled data points and to develop reliable ML methods that can effectively learn from small training sets. This work focuses on predicting the molecules atomization energy in the QM9 dataset. We investigate the advantages of employing domain knowledge-based data sampling methods for an efficient training set selection combined with informed ML techniques. In particular, we show how maximizing molecular diversity in the training set selection process increases the robustness of linear and nonlinear regression techniques such as kernel methods and graph neural networks. We also check the reliability of the predictions made by the graph neural network with a model-agnostic explainer based on the rate distortion explanation framework

arXiv.org e-Print Archive