22 research outputs found
Combining predictions from linear models when training and test inputs differ
Methods for combining predictions from different models in a supervised
learning setting must somehow estimate/predict the quality of a model's
predictions at unknown future inputs. Many of these methods (often implicitly)
make the assumption that the test inputs are identical to the training inputs,
which is seldom reasonable. By failing to take into account that prediction
will generally be harder for test inputs that did not occur in the training
set, this leads to the selection of too complex models. Based on a novel,
unbiased expression for KL divergence, we propose XAIC and its special case
FAIC as versions of AIC intended for prediction that use different degrees of
knowledge of the test inputs. Both methods substantially differ from and may
outperform all the known versions of AIC even when the training and test inputs
are iid, and are especially useful for deterministic inputs and under covariate
shift. Our experiments on linear models suggest that if the test and training
inputs differ substantially, then XAIC and FAIC predictively outperform AIC,
BIC and several other methods including Bayesian model averaging.Comment: 12 pages, 2 figures. To appear in Proceedings of the 30th Conference
on Uncertainty in Artificial Intelligence (UAI2014). This version includes
the supplementary material (regularity assumptions, proofs
Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It
We empirically show that Bayesian inference can be inconsistent under
misspecification in simple linear regression problems, both in a model
averaging/selection and in a Bayesian ridge regression setting. We use the
standard linear model, which assumes homoskedasticity, whereas the data are
heteroskedastic, and observe that the posterior puts its mass on ever more
high-dimensional models as the sample size increases. To remedy the problem, we
equip the likelihood in Bayes' theorem with an exponent called the learning
rate, and we propose the Safe Bayesian method to learn the learning rate from
the data. SafeBayes tends to select small learning rates as soon the standard
posterior is not `cumulatively concentrated', and its results on our data are
quite encouraging.Comment: 70 pages, 20 figure
Better predictions when models are wrong or underspecified
Many statistical methods rely on models of reality in order to learn from data and to make predictions about future data. By necessity, these models usually do not match reality exactly, but are either wrong (none of the hypotheses in the model provides an accurate description of reality) or underspecified (the hypotheses in the model describe only part of the data). In this thesis, we discuss three scenarios involving models that are wrong or underspecified. In each case, we find that standard statistical methods may fail, sometimes dramatically, and present different methods that continue to perform well even if the models are wrong or underspecified. The first two of these scenarios involve regression problems and investigate AIC (Akaike's Information Criterion) and Bayesian statistics. The third scenario has the famous Monty Hall problem as a special case, and considers the question how we can update our belief about an unknown outcome given new evidence when the precise relation between outcome and evidence is unknown.UBL - phd migration 201
Adapting AIC to conditional model selection
In statistical settings such as regression and time series, we can
condition on observed information when predicting the data of
interest. For example, a regression model explains the dependent
variables in terms of the independent variables
. When we ask such a model to predict the value of
corresponding to some given value of , that
prediction's accuracy will vary with . Existing methods for
model selection do not take this variability into account, which
often causes them to select inferior models.
One widely used method for model selection is AIC (Akaike's
Information Criterion \cite{Akaike}), which is based on estimates of
the KL divergence from the true distribution to each model. We
propose an adaptation of AIC that takes the observed information
into account when estimating the KL divergence, thereby getting rid
of a bias in AIC's estimate
Graphical Representations for Algebraic Constraints of Linear Structural Equations Models
The observational characteristics of a linear structural equation model can
be effectively described by polynomial constraints on the observed covariance
matrix. However, these polynomials can be exponentially large, making them
impractical for many purposes. In this paper, we present a graphical notation
for many of these polynomial constraints. The expressive power of this notation
is investigated both theoretically and empirically.Comment: To appear in the proceedings of the 11th International Conference on
Probabilistic Graphical Models (PGM 2022
Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it
We empirically show that Bayesian inference can be inconsistent under misspecification in simple linear regression problems, both in a model averaging/selection and in a Bayesian ridge regression setting. We use the standard linear model, which assumes homoskedasticity, whereas the data are heteroskedastic (though, significantly, there are no outliers). As sample size increases, the posterior puts its mass on worse and worse models of ever higher dimension. This is caused by hypercompression, the phenomenon that the posterior puts its mass on distributions that have much larger KL divergence from the ground truth than their average, i.e. the Bayes predictive distribution. To remedy the problem, we equip the likelihood in Bayes' theorem with an exponent called the learning rate, and we propose the SafeBayesian method to learn the learning rate from the data. SafeBayes tends to select small learning rates, and regularizes more, as soon as hypercompression takes place. Its results on our data are quite encouraging
Causal Entropy and Information Gain for Measuring Causal Control
Artificial intelligence models and methods commonly lack causal
interpretability. Despite the advancements in interpretable machine learning
(IML) methods, they frequently assign importance to features which lack causal
influence on the outcome variable. Selecting causally relevant features among
those identified as relevant by these methods, or even before model training,
would offer a solution. Feature selection methods utilizing information
theoretical quantities have been successful in identifying statistically
relevant features. However, the information theoretical quantities they are
based on do not incorporate causality, rendering them unsuitable for such
scenarios. To address this challenge, this article proposes information
theoretical quantities that incorporate the causal structure of the system,
which can be used to evaluate causal importance of features for some given
outcome variable. Specifically, we introduce causal versions of entropy and
mutual information, termed causal entropy and causal information gain, which
are designed to assess how much control a feature provides over the outcome
variable. These newly defined quantities capture changes in the entropy of a
variable resulting from interventions on other variables. Fundamental results
connecting these quantities to the existence of causal effects are derived. The
use of causal information gain in feature selection is demonstrated,
highlighting its superiority over standard mutual information in revealing
which features provide control over a chosen outcome variable. Our
investigation paves the way for the development of methods with improved
interpretability in domains involving causation.Comment: 16 pages. Accepted at the third XI-ML workshop of ECAI 2023. To
appear in the Springer CCIS book serie
Efficient algorithms for minimax decisions under tree-structured incompleteness
When decisions must be based on incomplete (coarsened) observations and the coarsening mechanism is unknown, a minimax approach offers the best guarantees on the decision makerâs expected loss. Recent work has derived mathematical conditions characterizing minimax optimal decisions, but also found that computing such decisions is a difficult problem in general. This problem is equivalent to that of maximizing a certain conditional entropy expression. In this work, we present a highly efficient algorithm for the case where the coarsening mechanism can be represented by a tree, whose vertices are outcomes and whose edges are coarse observations