4,064 research outputs found
Uncertainty-Aware Mixed-Variable Machine Learning for Materials Design
Data-driven design shows the promise of accelerating materials discovery but
is challenging due to the prohibitive cost of searching the vast design space
of chemistry, structure, and synthesis methods. Bayesian Optimization (BO)
employs uncertainty-aware machine learning models to select promising designs
to evaluate, hence reducing the cost. However, BO with mixed numerical and
categorical variables, which is of particular interest in materials design, has
not been well studied. In this work, we survey frequentist and Bayesian
approaches to uncertainty quantification of machine learning with mixed
variables. We then conduct a systematic comparative study of their performances
in BO using a popular representative model from each group, the random
forest-based Lolo model (frequentist) and the latent variable Gaussian process
model (Bayesian). We examine the efficacy of the two models in the optimization
of mathematical functions, as well as properties of structural and functional
materials, where we observe performance differences as related to problem
dimensionality and complexity. By investigating the machine learning models'
predictive and uncertainty estimation capabilities, we provide interpretations
of the observed performance differences. Our results provide practical guidance
on choosing between frequentist and Bayesian uncertainty-aware machine learning
models for mixed-variable BO in materials design
Rapid Design of Top-Performing Metal-Organic Frameworks with Qualitative Representations of Building Blocks
Data-driven materials design often encounters challenges where systems
require or possess qualitative (categorical) information. Metal-organic
frameworks (MOFs) are an example of such material systems. The representation
of MOFs through different building blocks makes it a challenge for designers to
incorporate qualitative information into design optimization. Furthermore, the
large number of potential building blocks leads to a combinatorial challenge,
with millions of possible MOFs that could be explored through time consuming
physics-based approaches. In this work, we integrated Latent Variable Gaussian
Process (LVGP) and Multi-Objective Batch-Bayesian Optimization (MOBBO) to
identify top-performing MOFs adaptively, autonomously, and efficiently without
any human intervention. Our approach provides three main advantages: (i) no
specific physical descriptors are required and only building blocks that
construct the MOFs are used in global optimization through qualitative
representations, (ii) the method is application and property independent, and
(iii) the latent variable approach provides an interpretable model of
qualitative building blocks with physical justification. To demonstrate the
effectiveness of our method, we considered a design space with more than 47,000
MOF candidates. By searching only ~1% of the design space, LVGP-MOBBO was able
to identify all MOFs on the Pareto front and more than 97% of the 50
top-performing designs for the CO working capacity and CO/N
selectivity properties. Finally, we compared our approach with the Random
Forest algorithm and demonstrated its efficiency, interpretability, and
robustness.Comment: 35 pages total. First 29 pages belong to the main manuscript and the
remaining 6 six are for the supplementary information, 13 figures total. 9
figures are on the main manuscript and 4 figures are in the supplementary
information. 1 table in the supplementary informatio
CGBayesNets: Conditional Gaussian Bayesian Network Learning and Inference with Mixed Discrete and Continuous Data
Bayesian Networks (BN) have been a popular predictive modeling formalism in bioinformatics, but their application in modern genomics has been slowed by an inability to cleanly handle domains with mixed discrete and continuous variables. Existing free BN software packages either discretize continuous variables, which can lead to information loss, or do not include inference routines, which makes prediction with the BN impossible. We present CGBayesNets, a BN package focused around prediction of a clinical phenotype from mixed discrete and continuous variables, which fills these gaps. CGBayesNets implements Bayesian likelihood and inference algorithms for the conditional Gaussian Bayesian network (CGBNs) formalism, one appropriate for predicting an outcome of interest from, e.g., multimodal genomic data. We provide four different network learning algorithms, each making a different tradeoff between computational cost and network likelihood. CGBayesNets provides a full suite of functions for model exploration and verification, including cross validation, bootstrapping, and AUC manipulation. We highlight several results obtained previously with CGBayesNets, including predictive models of wood properties from tree genomics, leukemia subtype classification from mixed genomic data, and robust prediction of intensive care unit mortality outcomes from metabolomic profiles. We also provide detailed example analysis on public metabolomic and gene expression datasets. CGBayesNets is implemented in MATLAB and available as MATLAB source code, under an Open Source license and anonymous download at http://www.cgbayesnets.com
Fully Bayesian inference for latent variable Gaussian process models
Real engineering and scientific applications often involve one or more
qualitative inputs. Standard Gaussian processes (GPs), however, cannot directly
accommodate qualitative inputs. The recently introduced latent variable
Gaussian process (LVGP) overcomes this issue by first mapping each qualitative
factor to underlying latent variables (LVs), and then uses any standard GP
covariance function over these LVs. The LVs are estimated similarly to the
other GP hyperparameters through maximum likelihood estimation, and then
plugged into the prediction expressions. However, this plug-in approach will
not account for uncertainty in estimation of the LVs, which can be significant
especially with limited training data. In this work, we develop a fully
Bayesian approach for the LVGP model and for visualizing the effects of the
qualitative inputs via their LVs. We also develop approximations for scaling up
LVGPs and fully Bayesian inference for the LVGP hyperparameters. We conduct
numerical studies comparing plug-in inference against fully Bayesian inference
over a few engineering models and material design applications. In contrast to
previous studies on standard GP modeling that have largely concluded that a
fully Bayesian treatment offers limited improvements, our results show that for
LVGP modeling it offers significant improvements in prediction accuracy and
uncertainty quantification over the plug-in approach
Black-box Mixed-Variable Optimisation using a Surrogate Model that Satisfies Integer Constraints
A challenging problem in both engineering and computer science is that of
minimising a function for which we have no mathematical formulation available,
that is expensive to evaluate, and that contains continuous and integer
variables, for example in automatic algorithm configuration. Surrogate-based
algorithms are very suitable for this type of problem, but most existing
techniques are designed with only continuous or only discrete variables in
mind. Mixed-Variable ReLU-based Surrogate Modelling (MVRSM) is a
surrogate-based algorithm that uses a linear combination of rectified linear
units, defined in such a way that (local) optima satisfy the integer
constraints. This method outperforms the state of the art on several synthetic
benchmarks with up to 238 continuous and integer variables, and achieves
competitive performance on two real-life benchmarks: XGBoost hyperparameter
tuning and Electrostatic Precipitator optimisation.Comment: Ann Math Artif Intell (2020
- …