486 research outputs found
Prediction of Atomization Energy Using Graph Kernel and Active Learning
Data-driven prediction of molecular properties presents unique challenges to
the design of machine learning methods concerning data
structure/dimensionality, symmetry adaption, and confidence management. In this
paper, we present a kernel-based pipeline that can learn and predict the
atomization energy of molecules with high accuracy. The framework employs
Gaussian process regression to perform predictions based on the similarity
between molecules, which is computed using the marginalized graph kernel. To
apply the marginalized graph kernel, a spatial adjacency rule is first employed
to convert molecules into graphs whose vertices and edges are labeled by
elements and interatomic distances, respectively. We then derive formulas for
the efficient evaluation of the kernel. Specific functional components for the
marginalized graph kernel are proposed, while the effect of the associated
hyperparameters on accuracy and predictive confidence are examined. We show
that the graph kernel is particularly suitable for predicting extensive
properties because its convolutional structure coincides with that of the
covariance formula between sums of random variables. Using an active learning
procedure, we demonstrate that the proposed method can achieve a mean absolute
error of 0.62 +- 0.01 kcal/mol using as few as 2000 training samples on the QM7
data set
Machine Learning, Quantum Mechanics, and Chemical Compound Space
We review recent studies dealing with the generation of machine learning
models of molecular and solid properties. The models are trained and validated
using standard quantum chemistry results obtained for organic molecules and
materials selected from chemical space at random
Hybrid localized graph kernel for machine learning energy-related properties of molecules and solids
Nowadays, the coupling of electronic structure and machine learning
techniques serves as a powerful tool to predict chemical and physical
properties of a broad range of systems. With the aim of improving the accuracy
of predictions, a large number of representations for molecules and solids for
machine learning applications has been developed. In this work we propose a
novel descriptor based on the notion of molecular graph. While graphs are
largely employed in classification problems in cheminformatics or
bioinformatics, they are not often used in regression problem, especially of
energy-related properties. Our method is based on a local decomposition of
atomic environments and on the hybridization of two kernel functions: a graph
kernel contribution that describes the chemical pattern and a Coulomb label
contribution that 1encodes finer details of the local geometry. The accuracy of
this new kernel method in energy predictions of molecular and condensed phase
systems is demonstrated by considering the popular QM7 and BA10 datasets. These
examples show that the hybrid localized graph kernel outperforms traditional
approaches such as, for example, the smooth overlap of atomic positions (SOAP)
and the Coulomb matrices
Prediction of the Atomization Energy of Molecules Using Coulomb Matrix and Atomic Composition in a Bayesian Regularized Neural Networks
Exact calculation of electronic properties of molecules is a fundamental step
for intelligent and rational compounds and materials design. The intrinsically
graph-like and non-vectorial nature of molecular data generates a unique and
challenging machine learning problem. In this paper we embrace a learning from
scratch approach where the quantum mechanical electronic properties of
molecules are predicted directly from the raw molecular geometry, similar to
some recent works. But, unlike these previous endeavors, our study suggests a
benefit from combining molecular geometry embedded in the Coulomb matrix with
the atomic composition of molecules. Using the new combined features in a
Bayesian regularized neural networks, our results improve well-known results
from the literature on the QM7 dataset from a mean absolute error of 3.51
kcal/mol down to 3.0 kcal/mol.Comment: Under review ICANN 201
On minimizing the training set fill distance in machine learning regression
For regression tasks one often leverages large datasets for training
predictive machine learning models. However, using large datasets may not be
feasible due to computational limitations or high data labelling costs.
Therefore, suitably selecting small training sets from large pools of
unlabelled data points is essential to maximize model performance while
maintaining efficiency. In this work, we study Farthest Point Sampling (FPS), a
data selection approach that aims to minimize the fill distance of the selected
set. We derive an upper bound for the maximum expected prediction error,
conditional to the location of the unlabelled data points, that linearly
depends on the training set fill distance. For empirical validation, we perform
experiments using two regression models on three datasets. We empirically show
that selecting a training set by aiming to minimize the fill distance, thereby
minimizing our derived bound, significantly reduces the maximum prediction
error of various regression models, outperforming alternative sampling
approaches by a large margin. Furthermore, we show that selecting training sets
with the FPS can also increase model stability for the specific case of
Gaussian kernel regression approaches
Uncertainty Quantification Using Neural Networks for Molecular Property Prediction
Uncertainty quantification (UQ) is an important component of molecular
property prediction, particularly for drug discovery applications where model
predictions direct experimental design and where unanticipated imprecision
wastes valuable time and resources. The need for UQ is especially acute for
neural models, which are becoming increasingly standard yet are challenging to
interpret. While several approaches to UQ have been proposed in the literature,
there is no clear consensus on the comparative performance of these models. In
this paper, we study this question in the context of regression tasks. We
systematically evaluate several methods on five benchmark datasets using
multiple complementary performance metrics. Our experiments show that none of
the methods we tested is unequivocally superior to all others, and none
produces a particularly reliable ranking of errors across multiple datasets.
While we believe these results show that existing UQ methods are not sufficient
for all common use-cases and demonstrate the benefits of further research, we
conclude with a practical recommendation as to which existing techniques seem
to perform well relative to others
Ab initio machine learning in chemical compound space
Chemical compound space (CCS), the set of all theoretically conceivable
combinations of chemical elements and (meta-)stable geometries that make up
matter, is colossal. The first principles based virtual sampling of this space,
for example in search of novel molecules or materials which exhibit desirable
properties, is therefore prohibitive for all but the smallest sub-sets and
simplest properties. We review studies aimed at tackling this challenge using
modern machine learning techniques based on (i) synthetic data, typically
generated using quantum mechanics based methods, and (ii) model architectures
inspired by quantum mechanics. Such Quantum mechanics based Machine Learning
(QML) approaches combine the numerical efficiency of statistical surrogate
models with an {\em ab initio} view on matter. They rigorously reflect the
underlying physics in order to reach universality and transferability across
CCS. While state-of-the-art approximations to quantum problems impose severe
computational bottlenecks, recent QML based developments indicate the
possibility of substantial acceleration without sacrificing the predictive
power of quantum mechanics
On the Interplay of Subset Selection and Informed Graph Neural Networks
Machine learning techniques paired with the availability of massive datasets
dramatically enhance our ability to explore the chemical compound space by
providing fast and accurate predictions of molecular properties. However,
learning on large datasets is strongly limited by the availability of
computational resources and can be infeasible in some scenarios. Moreover, the
instances in the datasets may not yet be labelled and generating the labels can
be costly, as in the case of quantum chemistry computations. Thus, there is a
need to select small training subsets from large pools of unlabelled data
points and to develop reliable ML methods that can effectively learn from small
training sets. This work focuses on predicting the molecules atomization energy
in the QM9 dataset. We investigate the advantages of employing domain
knowledge-based data sampling methods for an efficient training set selection
combined with informed ML techniques. In particular, we show how maximizing
molecular diversity in the training set selection process increases the
robustness of linear and nonlinear regression techniques such as kernel methods
and graph neural networks. We also check the reliability of the predictions
made by the graph neural network with a model-agnostic explainer based on the
rate distortion explanation framework
- …