As AI-based decision systems proliferate, their successful operationalization
requires balancing multiple desiderata: predictive performance, disparity
across groups, safeguarding sensitive group attributes (e.g., race), and
engineering cost. We present a holistic framework for evaluating and
contextualizing fairness interventions with respect to the above desiderata.
The two key points of practical consideration are \emph{where} (pre-, in-,
post-processing) and \emph{how} (in what way the sensitive group data is used)
the intervention is introduced. We demonstrate our framework with a case study
on predictive parity. In it, we first propose a novel method for achieving
predictive parity fairness without using group data at inference time via
distibutionally robust optimization. Then, we showcase the effectiveness of
these methods in a benchmarking study of close to 400 variations across two
major model types (XGBoost vs. Neural Net), ten datasets, and over twenty
unique methodologies. Methodological insights derived from our empirical study
inform the practical design of ML workflow with fairness as a central concern.
We find predictive parity is difficult to achieve without using group data, and
despite requiring group data during model training (but not inference),
distributionally robust methods we develop provide significant Pareto
improvement. Moreover, a plain XGBoost model often Pareto-dominates neural
networks with fairness interventions, highlighting the importance of model
inductive bias