Search CORE

14,476 research outputs found

Conformal Prediction for Federated Uncertainty Quantification Under Label Shift

Author: Makni Mehdi
Moulines Eric
Panov Maxim
Plassier Vincent
Rubashevskii Aleksandr
Publication venue
Publication date: 08/06/2023
Field of study

Federated Learning (FL) is a machine learning framework where many clients collaboratively train models while keeping the training data decentralized. Despite recent advances in FL, the uncertainty quantification topic (UQ) remains partially addressed. Among UQ methods, conformal prediction (CP) approaches provides distribution-free guarantees under minimal assumptions. We develop a new federated conformal prediction method based on quantile regression and take into account privacy constraints. This method takes advantage of importance weighting to effectively address the label shift between agents and provides theoretical guarantees for both valid coverage of the prediction sets and differential privacy. Extensive experimental studies demonstrate that this method outperforms current competitors.Comment: ICML 202

arXiv.org e-Print Archive

Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization

Author: Geambasu Roxana
Huang Tzu-Kuo
Lecuyer Mathias
Sen Siddhartha
Spahn Riley
Publication venue
Publication date: 21/05/2017
Field of study

Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data

arXiv.org e-Print Archive

Crossref

Use of record-linkage to handle non-response and improve alcohol consumption estimates in health survey data: a study protocol

Author: Gorman E.
Gorman E.
Gray L.
Gray L.
Katikireddi S. V.
Katikireddi S. V.
Leyland A. H.
Leyland A. H.
McCartney G.
McCartney G.
Rutherford L.
Rutherford L.
White I. R.
White I. R.
Publication venue: 'BMJ'
Publication date: 01/01/2013
Field of study

Introduction: Reliable estimates of health-related behaviours, such as levels of alcohol consumption in the population, are required to formulate and evaluate policies. National surveys provide such data; validity depends on generalisability, but this is threatened by declining response levels. Attempts to address bias arising from non-response are typically limited to survey weights based on sociodemographic characteristics, which do not capture differential health and related behaviours within categories. This project aims to explore and address non-response bias in health surveys with a focus on alcohol consumption. Methods and analysis: The Scottish Health Surveys (SHeS) aim to provide estimates representative of the Scottish population living in private households. Survey data of consenting participants (92% of the achieved sample) have been record-linked to routine hospital admission (Scottish Morbidity Records (SMR)) and mortality (from National Records of Scotland (NRS)) data for surveys conducted in 1995, 1998, 2003, 2008, 2009 and 2010 (total adult sample size around 40 000), with maximum follow-up of 16 years. Also available are census information and SMR/NRS data for the general population. Comparisons of alcohol-related mortality and hospital admission rates in the linked SHeS-SMR/NRS with those in the general population will be made. Survey data will be augmented by quantification of differences to refine alcohol consumption estimates through the application of multiple imputation or inverse probability weighting. The resulting corrected estimates of population alcohol consumption will enable superior policy evaluation. An advanced weighting procedure will be developed for wider use. Ethics and dissemination: Ethics approval for SHeS has been given by the National Health Service (NHS) Multi-Centre Research Ethics Committee and use of linked data has been approved by the Privacy Advisory Committee to the Board of NHS National Services Scotland and Registrar General. Funding has been granted by the MRC. The outputs will include four or five public health and statistical methodological international journal and conference papers.</p&gt

What is wrong with non-respondents? Alcohol-, drug- and smoking related mortality and morbidity in a 12-year follow up study of respondents and non-respondents in the Danish Health and Morbidity Survey

Author: Christensen Anne Illemann
Ekholm Ola
Glümer Charlotte
Gray Linsay
Juel Knud
Publication venue: 'Wiley'
Publication date: 02/06/2015
Field of study

Aim: Response rates in health surveys have diminished over the last two decades, making it difficult to obtain reliable information on health and health-related risk factors in different population groups. This study compared cause-specific mortality and morbidity among survey respondents and different types of non-respondents to estimate alcohol-, drug- and smoking related mortality and morbidity among non-respondents. Design: Prospective follow-up study of respondents and non-respondents in two cross-sectional health surveys. Setting: Denmark. Participants: A total sample of 39,540 Danish citizens aged 16 or older. Measurements: Register-based information on cause-specific mortality and morbidity at the individual level was obtained for respondents (n=28,072) and different types of non-respondents (refusals n=8,954; illness/disabled n=731, uncontactable n=1,593). Cox proportional hazards models were used to examine differences in alcohol-, drug- and smoking-related mortality and morbidity, respectively, in a 12 year follow-up period. Findings: Overall, non-response was associated with a significantly increased hazard ratio of 1.56 (95% CI: 1.36–1.78) for alcohol-related morbidity, 1.88 (95% CI: 1.38-2.57) for alcohol-related mortality, 1.55 (95% CI: 1.27–1.88) for drug-related morbidity, 3.04 (95% CI: 1.57–5.89) for drug-related mortality and 1.15 (95% CI: 1.03–1.29) for smoking-related morbidity. The hazard ratio for smoking-related mortality also tended to be higher among non-respondents compared with respondents although no significant association was evident (HR: 1.14; 95% CI: 0.95-1.36). Uncontactable and ill/disabled non-respondents generally had a higher hazard ratio of alcohol-, drug- and smoking related mortality and morbidity compared with refusal non-respondents. Conclusion: Health survey non-respondents in Denmark have an increased hazard ratio of alcohol-, drug-, and smoking-related mortality and morbidity compared with respondents, which may indicate more unfavourable health behaviours among non-respondents

PubMed Central

Enlighten

Explaining Data-Driven Decisions made by AI Systems: The Counterfactual Approach

Author: Fernández-Loría Carlos
Han Xintian
Provost Foster
Publication venue
Publication date: 13/10/2021
Field of study

We examine counterfactual explanations for explaining the decisions made by model-based AI systems. The counterfactual approach we consider defines an explanation as a set of the system's data inputs that causally drives the decision (i.e., changing the inputs in the set changes the decision) and is irreducible (i.e., changing any subset of the inputs does not change the decision). We (1) demonstrate how this framework may be used to provide explanations for decisions made by general, data-driven AI systems that may incorporate features with arbitrary data types and multiple predictive models, and (2) propose a heuristic procedure to find the most useful explanations depending on the context. We then contrast counterfactual explanations with methods that explain model predictions by weighting features according to their importance (e.g., SHAP, LIME) and present two fundamental reasons why we should carefully consider whether importance-weight explanations are well-suited to explain system decisions. Specifically, we show that (i) features that have a large importance weight for a model prediction may not affect the corresponding decision, and (ii) importance weights are insufficient to communicate whether and how features influence decisions. We demonstrate this with several concise examples and three detailed case studies that compare the counterfactual approach with SHAP to illustrate various conditions under which counterfactual explanations explain data-driven decisions better than importance weights

arXiv.org e-Print Archive

AIS Electronic Library (AISeL)