Machine learning (ML) of quantum mechanical properties shows promise for
accelerating chemical discovery. For transition metal chemistry where accurate
calculations are computationally costly and available training data sets are
small, the molecular representation becomes a critical ingredient in ML model
predictive accuracy. We introduce a series of revised autocorrelation functions
(RACs) that encode relationships between the heuristic atomic properties (e.g.,
size, connectivity, and electronegativity) on a molecular graph. We alter the
starting point, scope, and nature of the quantities evaluated in standard ACs
to make these RACs amenable to inorganic chemistry. On an organic molecule set,
we first demonstrate superior standard AC performance to other
presently-available topological descriptors for ML model training, with mean
unsigned errors (MUEs) for atomization energies on set-aside test molecules as
low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs
on set-aside test molecules in spin-state splitting in comparison to 15-20x
higher errors from feature sets that encode whole-molecule structural
information. Systematic feature selection methods including univariate
filtering, recursive feature elimination, and direct optimization (e.g., random
forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4-5x
smaller than RAC-155 produce sub- to 1-kcal/mol spin-splitting MUEs, with good
transferability to metal-ligand bond length prediction (0.004-5 {\AA} MUE) and
redox potential on a smaller data set (0.2-0.3 eV MUE). Evaluation of feature
selection results across property sets reveals the relative importance of
local, electronic descriptors (e.g., electronegativity, atomic number) in
spin-splitting and distal, steric effects in redox potential and bond lengths.Comment: 43 double spaced pages, 11 figures, 4 table