Search CORE

4,287 research outputs found

Certifying and removing disparate impact

Author: Hodson H.
Menon A.
Miao W.
Peresie J. L.
Podesta J.
Romei A.
Zemel R.
Publication venue
Publication date: 01/01/2015
Field of study

What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process. When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses. We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.Comment: Extended version of paper accepted at 2015 ACM SIGKDD Conference on Knowledge Discovery and Data Minin

arXiv.org e-Print Archive

Crossref

Haverford College: Haverford Scholarship

Predicting sample size required for classification performance

Author: A Vlachos
AH Briggs
AV Carneiro
C Cortes
CJ Adcock
F Olsson
F Provost
HM Kalayeh
I Scheinin
J Algina
J Cai
J Cohen
J Eng
J Yuan
J Zhu
JE Dennis
K Brinker
K Dobbin
K Fukunaga
K Nigam
KR Hess
LE Yelle
Long H Ngo
M Last
M Li
MK Warmuth
MR Jiroutek
N Boonyanunta
Qing Zeng-Treitler
RL Figueroa
Rosa L Figueroa
RV Lenth
S Kandula
S Mukherjee
S Tong
S-Y Kim
Sasikiran Kandula
SE Maxwell
SJ Walters
SL Beal
V Stalbovskaya
VH Tam
Y Liu
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target. Methods We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method. Results A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05). Conclusions This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.</p

Crossref

Harvard University - DASH

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central