2,498 research outputs found
Feature Optimization for Atomistic Machine Learning Yields A Data-Driven Construction of the Periodic Table of the Elements
Machine-learning of atomic-scale properties amounts to extracting
correlations between structure, composition and the quantity that one wants to
predict. Representing the input structure in a way that best reflects such
correlations makes it possible to improve the accuracy of the model for a given
amount of reference data. When using a description of the structures that is
transparent and well-principled, optimizing the representation might reveal
insights into the chemistry of the data set. Here we show how one can
generalize the SOAP kernel to introduce a distance-dependent weight that
accounts for the multi-scale nature of the interactions, and a description of
correlations between chemical species. We show that this improves substantially
the performance of ML models of molecular and materials stability, while making
it easier to work with complex, multi-component systems and to extend SOAP to
coarse-grained intermolecular potentials. The element correlations that give
the best performing model show striking similarities with the conventional
periodic table of the elements, providing an inspiring example of how machine
learning can rediscover, and generalize, intuitive concepts that constitute the
foundations of chemistry.Comment: 9 pages, 4 figure
Band gap prediction for large organic crystal structures with machine learning
Machine-learning models are capable of capturing the structure-property
relationship from a dataset of computationally demanding ab initio
calculations. Over the past two years, the Organic Materials Database (OMDB)
has hosted a growing number of calculated electronic properties of previously
synthesized organic crystal structures. The complexity of the organic crystals
contained within the OMDB, which have on average 82 atoms per unit cell, makes
this database a challenging platform for machine learning applications. In this
paper, the focus is on predicting the band gap which represents one of the
basic properties of a crystalline materials. With this aim, a consistent
dataset of 12 500 crystal structures and their corresponding DFT band gap are
released, freely available for download at https://omdb.mathub.io/dataset. An
ensemble of two state-of-the-art models reach a mean absolute error (MAE) of
0.388 eV, which corresponds to a percentage error of 13% for an average band
gap of 3.05 eV. Finally, the trained models are employed to predict the band
gap for 260 092 materials contained within the Crystallography Open Database
(COD) and made available online so that the predictions can be obtained for any
arbitrary crystal structure uploaded by a user.Comment: 10 pages, 6 figure
Atom-Density Representations for Machine Learning
The applications of machine learning techniques to chemistry and materials
science become more numerous by the day. The main challenge is to devise
representations of atomic systems that are at the same time complete and
concise, so as to reduce the number of reference calculations that are needed
to predict the properties of different types of materials reliably. This has
led to a proliferation of alternative ways to convert an atomic structure into
an input for a machine-learning model. We introduce an abstract definition of
chemical environments that is based on a smoothed atomic density, using a
bra-ket notation to emphasize basis set independence and to highlight the
connections with some popular choices of representations for describing atomic
systems. The correlations between the spatial distribution of atoms and their
chemical identities are computed as inner products between these feature kets,
which can be given an explicit representation in terms of the expansion of the
atom density on orthogonal basis functions, that is equivalent to the smooth
overlap of atomic positions (SOAP) power spectrum, but also in real space,
corresponding to -body correlations of the atom density. This formalism lays
the foundations for a more systematic tuning of the behavior of the
representations, by introducing operators that represent the correlations
between structure, composition, and the target properties. It provides a
unifying picture of recent developments in the field and indicates a way
forward towards more effective and computationally affordable machine-learning
schemes for molecules and materials
Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning
Computational study of molecules and materials from first principles is a cornerstone of physics, chemistry and materials science, but limited by the cost of accurate and precise simulations. In settings involving many simulations, machine learning can reduce these costs, sometimes by orders of magnitude, by interpolating between reference simulations. This requires representations that describe any molecule or material and support interpolation. We review, discuss and benchmark state-of-the-art representations and relations between them, including smooth overlap of atomic positions, many-body tensor representation, and symmetry functions. For this, we use a unified mathematical framework based on many-body functions, group averaging and tensor products, and compare energy predictions for organic molecules, binary alloys and Al-Ga-In sesquioxides in numerical experiments controlled for data distribution, regression method and hyper-parameter optimization
Non-covalent interactions across organic and biological subsets of chemical space: Physics-based potentials parametrized from machine learning
Classical intermolecular potentials typically require an extensive
parametrization procedure for any new compound considered. To do away with
prior parametrization, we propose a combination of physics-based potentials
with machine learning (ML), coined IPML, which is transferable across small
neutral organic and biologically-relevant molecules. ML models provide
on-the-fly predictions for environment-dependent local atomic properties:
electrostatic multipole coefficients (significant error reduction compared to
previously reported), the population and decay rate of valence atomic
densities, and polarizabilities across conformations and chemical compositions
of H, C, N, and O atoms. These parameters enable accurate calculations of
intermolecular contributions---electrostatics, charge penetration, repulsion,
induction/polarization, and many-body dispersion. Unlike other potentials, this
model is transferable in its ability to handle new molecules and conformations
without explicit prior parametrization: All local atomic properties are
predicted from ML, leaving only eight global parameters---optimized once and
for all across compounds. We validate IPML on various gas-phase dimers at and
away from equilibrium separation, where we obtain mean absolute errors between
0.4 and 0.7 kcal/mol for several chemically and conformationally diverse
datasets representative of non-covalent interactions in biologically-relevant
molecules. We further focus on hydrogen-bonded complexes---essential but
challenging due to their directional nature---where datasets of DNA base pairs
and amino acids yield an extremely encouraging 1.4 kcal/mol error. Finally, and
as a first look, we consider IPML in denser systems: water clusters,
supramolecular host-guest complexes, and the benzene crystal.Comment: 15 pages, 9 figure
Prediction of Atomization Energy Using Graph Kernel and Active Learning
Data-driven prediction of molecular properties presents unique challenges to
the design of machine learning methods concerning data
structure/dimensionality, symmetry adaption, and confidence management. In this
paper, we present a kernel-based pipeline that can learn and predict the
atomization energy of molecules with high accuracy. The framework employs
Gaussian process regression to perform predictions based on the similarity
between molecules, which is computed using the marginalized graph kernel. To
apply the marginalized graph kernel, a spatial adjacency rule is first employed
to convert molecules into graphs whose vertices and edges are labeled by
elements and interatomic distances, respectively. We then derive formulas for
the efficient evaluation of the kernel. Specific functional components for the
marginalized graph kernel are proposed, while the effect of the associated
hyperparameters on accuracy and predictive confidence are examined. We show
that the graph kernel is particularly suitable for predicting extensive
properties because its convolutional structure coincides with that of the
covariance formula between sums of random variables. Using an active learning
procedure, we demonstrate that the proposed method can achieve a mean absolute
error of 0.62 +- 0.01 kcal/mol using as few as 2000 training samples on the QM7
data set
- …