50 research outputs found
Supersparse Linear Integer Models for Optimized Medical Scoring Systems
Scoring systems are linear classification models that only require users to
add, subtract and multiply a few small numbers in order to make a prediction.
These models are in widespread use by the medical community, but are difficult
to learn from data because they need to be accurate and sparse, have coprime
integer coefficients, and satisfy multiple operational constraints. We present
a new method for creating data-driven scoring systems called a Supersparse
Linear Integer Model (SLIM). SLIM scoring systems are built by solving an
integer program that directly encodes measures of accuracy (the 0-1 loss) and
sparsity (the -seminorm) while restricting coefficients to coprime
integers. SLIM can seamlessly incorporate a wide range of operational
constraints related to accuracy and sparsity, and can produce highly tailored
models without parameter tuning. We provide bounds on the testing and training
accuracy of SLIM scoring systems, and present a new data reduction technique
that can improve scalability by eliminating a portion of the training data
beforehand. Our paper includes results from a collaboration with the
Massachusetts General Hospital Sleep Laboratory, where SLIM was used to create
a highly tailored scoring system for sleep apnea screeningComment: This version reflects our findings on SLIM as of January 2016
(arXiv:1306.5860 and arXiv:1405.4047 are out-of-date). The final published
version of this articled is available at http://www.springerlink.co
The Markov chain Monte Carlo approach to importance sampling in stochastic programming
Thesis: S.M., Massachusetts Institute of Technology, Computation for Design and Optimization Program, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 85-87).Stochastic programming models are large-scale optimization problems that are used to facilitate decision-making under uncertainty. Optimization algorithms for such problems need to evaluate the expected future costs of current decisions, often referred to as the recourse function. In practice, this calculation is computationally difficult as it involves the evaluation of a multidimensional integral whose integrand is an optimization problem. Accordingly, the recourse function is estimated using quadrature rules or Monte Carlo methods. Although Monte Carlo methods present numerous computational benefits over quadrature rules, they require a large number of samples to produce accurate results when they are embedded in an optimization algorithm. We present an importance sampling framework for multistage stochastic programming that can produce accurate estimates of the recourse function using a fixed number of samples. Our framework uses Markov Chain Monte Carlo and Kernel Density Estimation algorithms to create a non-parametric importance sampling distribution that can form lower variance estimates of the recourse function. We demonstrate the increased accuracy and efficiency of our approach using numerical experiments in which we solve variants of the Newsvendor problem. Our results show that even a simple implementation of our framework produces highly accurate estimates of the optimal solution and optimal cost for stochastic programming models, especially those with increased variance, multimodal or rare-event distributions.by Berk Ustun.S.M
When Personalization Harms: Reconsidering the Use of Group Attributes in Prediction
Machine learning models are often personalized with categorical attributes
that are protected, sensitive, self-reported, or costly to acquire. In this
work, we show models that are personalized with group attributes can reduce
performance at a group level. We propose formal conditions to ensure the "fair
use" of group attributes in prediction tasks by training one additional model
-- i.e., collective preference guarantees to ensure that each group who
provides personal data will receive a tailored gain in performance in return.
We present sufficient conditions to ensure fair use in empirical risk
minimization and characterize failure modes that lead to fair use violations
due to standard practices in model development and deployment. We present a
comprehensive empirical study of fair use in clinical prediction tasks. Our
results demonstrate the prevalence of fair use violations in practice and
illustrate simple interventions to mitigate their harm.Comment: ICML 2023 Ora
Prediction without Preclusion: Recourse Verification with Reachable Sets
Machine learning models are often used to decide who will receive a loan, a
job interview, or a public benefit. Standard techniques to build these models
use features about people but overlook their actionability. In turn, models can
assign predictions that are fixed, meaning that consumers who are denied loans,
interviews, or benefits may be permanently locked out from access to credit,
employment, or assistance. In this work, we introduce a formal testing
procedure to flag models that assign fixed predictions that we call recourse
verification. We develop machinery to reliably determine if a given model can
provide recourse to its decision subjects from a set of user-specified
actionability constraints. We demonstrate how our tools can ensure recourse and
adversarial robustness in real-world datasets and use them to study the
infeasibility of recourse in real-world lending datasets. Our results highlight
how models can inadvertently assign fixed predictions that permanently bar
access, and we provide tools to design algorithms that account for
actionability when developing models