18 research outputs found
Is this model reliable for everyone? Testing for strong calibration
In a well-calibrated risk prediction model, the average predicted probability
is close to the true event rate for any given subgroup. Such models are
reliable across heterogeneous populations and satisfy strong notions of
algorithmic fairness. However, the task of auditing a model for strong
calibration is well-known to be difficult -- particularly for machine learning
(ML) algorithms -- due to the sheer number of potential subgroups. As such,
common practice is to only assess calibration with respect to a few predefined
subgroups. Recent developments in goodness-of-fit testing offer potential
solutions but are not designed for settings with weak signal or where the
poorly calibrated subgroup is small, as they either overly subdivide the data
or fail to divide the data at all. We introduce a new testing procedure based
on the following insight: if we can reorder observations by their expected
residuals, there should be a change in the association between the predicted
and observed residuals along this sequence if a poorly calibrated subgroup
exists. This lets us reframe the problem of calibration testing into one of
changepoint detection, for which powerful methods already exist. We begin with
introducing a sample-splitting procedure where a portion of the data is used to
train a suite of candidate models for predicting the residual, and the
remaining data are used to perform a score-based cumulative sum (CUSUM) test.
To further improve power, we then extend this adaptive CUSUM test to
incorporate cross-validation, while maintaining Type I error control under
minimal assumptions. Compared to existing methods, the proposed procedure
consistently achieved higher power in simulation studies and more than doubled
the power when auditing a mortality risk prediction model
M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization via Multiplier Induced Loss Landscape Scheduling
We address the online combinatorial choice of weight multipliers for
multi-objective optimization of many loss terms parameterized by neural works
via a probabilistic graphical model (PGM) for the joint model parameter and
multiplier evolution process, with a hypervolume based likelihood promoting
multi-objective descent. The corresponding parameter and multiplier estimation
as a sequential decision process is then cast into an optimal control problem,
where the multi-objective descent goal is dispatched hierarchically into a
series of constraint optimization sub-problems. The subproblem constraint
automatically adapts itself according to Pareto dominance and serves as the
setpoint for the low level multiplier controller to schedule loss landscapes
via output feedback of each loss term. Our method is multiplier-free and
operates at the timescale of epochs, thus saves tremendous computational
resources compared to full training cycle multiplier tuning. It also
circumvents the excessive memory requirements and heavy computational burden of
existing multi-objective deep learning methods. We applied it to domain
invariant variational auto-encoding with 6 loss terms on the PACS domain
generalization task, and observed robust performance across a range of
controller hyperparameters, as well as different multiplier initial conditions,
outperforming other multiplier scheduling methods. We offered modular
implementation of our method, admitting extension to custom definition of many
loss terms
Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens
After a machine learning (ML)-based system is deployed, monitoring its
performance is important to ensure the safety and effectiveness of the
algorithm over time. When an ML algorithm interacts with its environment, the
algorithm can affect the data-generating mechanism and be a major source of
bias when evaluating its standalone performance, an issue known as
performativity. Although prior work has shown how to validate models in the
presence of performativity using causal inference techniques, there has been
little work on how to monitor models in the presence of performativity. Unlike
the setting of model validation, there is much less agreement on which
performance metrics to monitor. Different monitoring criteria impact how
interpretable the resulting test statistic is, what assumptions are needed for
identifiability, and the speed of detection. When this choice is further
coupled with the decision to use observational versus interventional data, ML
deployment teams are faced with a multitude of monitoring options. The aim of
this work is to highlight the relatively under-appreciated complexity of
designing a monitoring strategy and how causal reasoning can provide a
systematic framework for choosing between these options. As a motivating
example, we consider an ML-based risk prediction algorithm for predicting
unplanned readmissions. Bringing together tools from causal inference and
statistical process control, we consider six monitoring procedures (three
candidate monitoring criteria and two data sources) and investigate their
operating characteristics in simulation studies. Results from this case study
emphasize the seemingly simple (and obvious) fact that not all monitoring
systems are created equal, which has real-world impacts on the design and
documentation of ML monitoring systems
Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees.
ObjectiveAfter deploying a clinical prediction model, subsequently collected data can be used to fine-tune its predictions and adapt to temporal shifts. Because model updating carries risks of over-updating/fitting, we study online methods with performance guarantees.Materials and methodsWe introduce 2 procedures for continual recalibration or revision of an underlying prediction model: Bayesian logistic regression (BLR) and a Markov variant that explicitly models distribution shifts (MarBLR). We perform empirical evaluation via simulations and a real-world study predicting Chronic Obstructive Pulmonary Disease (COPD) risk. We derive "Type I and II" regret bounds, which guarantee the procedures are noninferior to a static model and competitive with an oracle logistic reviser in terms of the average loss.ResultsBoth procedures consistently outperformed the static model and other online logistic revision methods. In simulations, the average estimated calibration index (aECI) of the original model was 0.828 (95%CI, 0.818-0.938). Online recalibration using BLR and MarBLR improved the aECI towards the ideal value of zero, attaining 0.265 (95%CI, 0.230-0.300) and 0.241 (95%CI, 0.216-0.266), respectively. When performing more extensive logistic model revisions, BLR and MarBLR increased the average area under the receiver-operating characteristic curve (aAUC) from 0.767 (95%CI, 0.765-0.769) to 0.800 (95%CI, 0.798-0.802) and 0.799 (95%CI, 0.797-0.801), respectively, in stationary settings and protected against substantial model decay. In the COPD study, BLR and MarBLR dynamically combined the original model with a continually refitted gradient boosted tree to achieve aAUCs of 0.924 (95%CI, 0.913-0.935) and 0.925 (95%CI, 0.914-0.935), compared to the static model's aAUC of 0.904 (95%CI, 0.892-0.916).DiscussionDespite its simplicity, BLR is highly competitive with MarBLR. MarBLR outperforms BLR when its prior better reflects the data.ConclusionsBLR and MarBLR can improve the transportability of clinical prediction models and maintain their performance over time
Group SLOPE – Adaptive Selection of Groups of Predictors
<p>Sorted L-One Penalized Estimation (SLOPE; Bogdan et al. <a href="#cit0011" target="_blank">2013</a>, <a href="#cit0010" target="_blank">2015</a>) is a relatively new convex optimization procedure, which allows for adaptive selection of regressors under sparse high-dimensional designs. Here, we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. Such groups can be formed by clustering strongly correlated predictors or groups of dummy variables corresponding to different levels of the same qualitative predictor. We formulate the respective convex optimization problem, group SLOPE (gSLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations. Finally, we illustrate the advantages of gSLOPE in the context of Genome Wide Association Studies. R package grpSLOPE with an implementation of our method is available on The Comprehensive R Archive Network.</p