Search CORE

18 research outputs found

Is this model reliable for everyone? Testing for strong calibration

Author: Feng Jean
Gossmann Alexej
Pennello Gene
Petrick Nicholas
Pirracchio Romain
Sahiner Berkman
Publication venue
Publication date: 27/07/2023
Field of study

In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This lets us reframe the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. We begin with introducing a sample-splitting procedure where a portion of the data is used to train a suite of candidate models for predicting the residual, and the remaining data are used to perform a score-based cumulative sum (CUSUM) test. To further improve power, we then extend this adaptive CUSUM test to incorporate cross-validation, while maintaining Type I error control under minimal assumptions. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and more than doubled the power when auditing a mortality risk prediction model

arXiv.org e-Print Archive

M-HOF-Opt: Multi-Objective Hierarchical Output Feedback Optimization via Multiplier Induced Loss Landscape Scheduling

Author: Beer Lisa
Chen Nutan
Dorigatt Emilio
Drost Felix
Feistner Carla
Gossmann Alexej
Marr Carsten
Scarcella Daniele
Sun Xudong
Xing Yu
Publication venue
Publication date: 10/04/2024
Field of study

We address the online combinatorial choice of weight multipliers for multi-objective optimization of many loss terms parameterized by neural works via a probabilistic graphical model (PGM) for the joint model parameter and multiplier evolution process, with a hypervolume based likelihood promoting multi-objective descent. The corresponding parameter and multiplier estimation as a sequential decision process is then cast into an optimal control problem, where the multi-objective descent goal is dispatched hierarchically into a series of constraint optimization sub-problems. The subproblem constraint automatically adapts itself according to Pareto dominance and serves as the setpoint for the low level multiplier controller to schedule loss landscapes via output feedback of each loss term. Our method is multiplier-free and operates at the timescale of epochs, thus saves tremendous computational resources compared to full training cycle multiplier tuning. It also circumvents the excessive memory requirements and heavy computational burden of existing multi-objective deep learning methods. We applied it to domain invariant variational auto-encoding with 6 loss terms on the PACS domain generalization task, and observed robust performance across a range of controller hyperparameters, as well as different multiplier initial conditions, outperforming other multiplier scheduling methods. We offered modular implementation of our method, admitting extension to custom definition of many loss terms

arXiv.org e-Print Archive

Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens

Author: Feng Jean
Gossmann Alexej
Kim Mi-Ok
Pennello Gene
Petrick Nicholas
Pirracchio Romain
Sahiner Berkman
Singh Harvineet
Subbaswamy Adarsh
Xia Fan
Publication venue
Publication date: 26/02/2024
Field of study

After a machine learning (ML)-based system is deployed, monitoring its performance is important to ensure the safety and effectiveness of the algorithm over time. When an ML algorithm interacts with its environment, the algorithm can affect the data-generating mechanism and be a major source of bias when evaluating its standalone performance, an issue known as performativity. Although prior work has shown how to validate models in the presence of performativity using causal inference techniques, there has been little work on how to monitor models in the presence of performativity. Unlike the setting of model validation, there is much less agreement on which performance metrics to monitor. Different monitoring criteria impact how interpretable the resulting test statistic is, what assumptions are needed for identifiability, and the speed of detection. When this choice is further coupled with the decision to use observational versus interventional data, ML deployment teams are faced with a multitude of monitoring options. The aim of this work is to highlight the relatively under-appreciated complexity of designing a monitoring strategy and how causal reasoning can provide a systematic framework for choosing between these options. As a motivating example, we consider an ML-based risk prediction algorithm for predicting unplanned readmissions. Bringing together tools from causal inference and statistical process control, we consider six monitoring procedures (three candidate monitoring criteria and two data sources) and investigate their operating characteristics in simulation studies. Results from this case study emphasize the seemingly simple (and obvious) fact that not all monitoring systems are created equal, which has real-world impacts on the design and documentation of ML monitoring systems

arXiv.org e-Print Archive

Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees.

Author: Feng Jean
Gossmann Alexej
Pirracchio Romain
Sahiner Berkman
Publication venue: eScholarship, University of California
Publication date: 13/10/2021
Field of study

ObjectiveAfter deploying a clinical prediction model, subsequently collected data can be used to fine-tune its predictions and adapt to temporal shifts. Because model updating carries risks of over-updating/fitting, we study online methods with performance guarantees.Materials and methodsWe introduce 2 procedures for continual recalibration or revision of an underlying prediction model: Bayesian logistic regression (BLR) and a Markov variant that explicitly models distribution shifts (MarBLR). We perform empirical evaluation via simulations and a real-world study predicting Chronic Obstructive Pulmonary Disease (COPD) risk. We derive "Type I and II" regret bounds, which guarantee the procedures are noninferior to a static model and competitive with an oracle logistic reviser in terms of the average loss.ResultsBoth procedures consistently outperformed the static model and other online logistic revision methods. In simulations, the average estimated calibration index (aECI) of the original model was 0.828 (95%CI, 0.818-0.938). Online recalibration using BLR and MarBLR improved the aECI towards the ideal value of zero, attaining 0.265 (95%CI, 0.230-0.300) and 0.241 (95%CI, 0.216-0.266), respectively. When performing more extensive logistic model revisions, BLR and MarBLR increased the average area under the receiver-operating characteristic curve (aAUC) from 0.767 (95%CI, 0.765-0.769) to 0.800 (95%CI, 0.798-0.802) and 0.799 (95%CI, 0.797-0.801), respectively, in stationary settings and protected against substantial model decay. In the COPD study, BLR and MarBLR dynamically combined the original model with a continually refitted gradient boosted tree to achieve aAUCs of 0.924 (95%CI, 0.913-0.935) and 0.925 (95%CI, 0.914-0.935), compared to the static model's aAUC of 0.904 (95%CI, 0.892-0.916).DiscussionDespite its simplicity, BLR is highly competitive with MarBLR. MarBLR outperforms BLR when its prior better reflects the data.ConclusionsBLR and MarBLR can improve the transportability of clinical prediction models and maintain their performance over time

arXiv.org e-Print Archive

PubMed Central

eScholarship - University of California

FDR-Corrected Sparse Canonical Correlation Analysis With Applications to Imaging Genomics

Author: Alexej Gossmann
Pascal Zille
Vince Calhoun
Yu-Ping Wang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Group SLOPE – Adaptive Selection of Groups of Predictors

Author: Alexej Gossmann (811810)
Damian Brzyski (4765461)
Małgorzata Bogdan (4372510)
Weijie Su (2142439)
Publication venue
Publication date: 07/08/2018
Field of study

<p>Sorted L-One Penalized Estimation (SLOPE; Bogdan et al. <a href="#cit0011" target="_blank">2013</a>, <a href="#cit0010" target="_blank">2015</a>) is a relatively new convex optimization procedure, which allows for adaptive selection of regressors under sparse high-dimensional designs. Here, we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. Such groups can be formed by clustering strongly correlated predictors or groups of dummy variables corresponding to different levels of the same qualitative predictor. We formulate the respective convex optimization problem, group SLOPE (gSLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations. Finally, we illustrate the advantages of gSLOPE in the context of Genome Wide Association Studies. R package grpSLOPE with an implementation of our method is available on The Comprehensive R Archive Network.</p

Jagiellonian Univeristy Repository

The Francis Crick Institute

A sparse regression method for group-wise feature selection with false discovery rate control

Author: Brzyski Damian
Cao Shaolong
Deng Hong-Wen
Gossmann Alexej
Wang Yu-Ping
Zhao Lan-Juan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Crossref

Jagiellonian Univeristy Repository

Group SLOPE – Adaptive Selection of Groups of Predictors

Author: Alexej Gossmann
Damian Brzyski
Figueiredo M. A. T.
Gossmann A.
Małgorzata Bogdan
Rantakallio P.
Simon N.
Su W.
Tibshirani R.
Weijie Su
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref