2,968 research outputs found

    Prediction of Atomization Energy Using Graph Kernel and Active Learning

    Get PDF
    Data-driven prediction of molecular properties presents unique challenges to the design of machine learning methods concerning data structure/dimensionality, symmetry adaption, and confidence management. In this paper, we present a kernel-based pipeline that can learn and predict the atomization energy of molecules with high accuracy. The framework employs Gaussian process regression to perform predictions based on the similarity between molecules, which is computed using the marginalized graph kernel. To apply the marginalized graph kernel, a spatial adjacency rule is first employed to convert molecules into graphs whose vertices and edges are labeled by elements and interatomic distances, respectively. We then derive formulas for the efficient evaluation of the kernel. Specific functional components for the marginalized graph kernel are proposed, while the effect of the associated hyperparameters on accuracy and predictive confidence are examined. We show that the graph kernel is particularly suitable for predicting extensive properties because its convolutional structure coincides with that of the covariance formula between sums of random variables. Using an active learning procedure, we demonstrate that the proposed method can achieve a mean absolute error of 0.62 +- 0.01 kcal/mol using as few as 2000 training samples on the QM7 data set

    Smoothing Hazard Functions and Time-Varying Effects in Discrete Duration and Competing Risks Models

    Get PDF
    State space or dynamic approaches to discrete or grouped duration data with competing risks or multiple terminating events allow simultaneous modelling and smooth estimation of hazard functions and time-varying effects in a flexible way. Full Bayesian or posterior mean estimation, using numerical integration techniques or Monte Carlo methods, can become computationally rather demanding or even infeasible for higher dimensions and larger data sets. Therefore, based on previous work on filtering and smoothing for multicategorical time series and longitudinal data, our approach uses posterior mode estimation. Thus we have to maximize posterior densities or, equivalently, a penalized likelihood, which enforces smoothness of hazard functions and time-varying effects by a roughness penalty. Dropping the Bayesian smoothness prior and adopting a nonparametric viewpoint, one might also start directly from maximizing this penalized likelihood. We show how Fisher scoring smoothing iterations can be carried out efficiently by iteratively applying linear Kalman filtering and smoothing to a working model. This algorithm can be combined with an EM-type procedure to estimate unknown smoothing- or hyperparameters. The methods are applied to a larger set of unemployment duration data with one and, in a further analysis, multiple terminating events from the German socio-economic panel GSOEP
    corecore