Search CORE

41 research outputs found

Strong Data Processing Inequalities for Input Constrained Additive Noise Channels

Author: Calmon Flavio P.
Polyanskiy Yury
Wu Yihong
Publication venue
Publication date: 17/11/2017
Field of study

This paper quantifies the intuitive observation that adding noise reduces available information by means of non-linear strong data processing inequalities. Consider the random variables

W\to X\to Y

forming a Markov chain, where

Y=X+Z

with

X

and

Z

real-valued, independent and

X

bounded in

L_p

-norm. It is shown that

I(W;Y) \le F_I(I(W;X))

with

F_I(t)<t

whenever

t>0

, if and only if

Z

has a density whose support is not disjoint from any translate of itself. A related question is to characterize for what couplings

(W,X)

the mutual information

I(W;Y)

is close to maximum possible. To that end we show that in order to saturate the channel, i.e. for

I(W;Y)

to approach capacity, it is mandatory that

I(W;X)\to\infty

(under suitable conditions on the channel). A key ingredient for this result is a deconvolution lemma which shows that post-convolution total variation distance bounds the pre-convolution Kolmogorov-Smirnov distance. Explicit bounds are provided for the special case of the additive Gaussian noise channel with quadratic cost constraint. These bounds are shown to be order-optimal. For this case simplified proofs are provided leveraging Gaussian-specific tools such as the connection between information and estimation (I-MMSE) and Talagrand's information-transportation inequality

arXiv.org e-Print Archive

Correspondence Analysis Using Neural Networks

Author: Calmon Flavio P.
Hsu Hsiang
Salamatian Salman
Publication venue
Publication date: 20/02/2019
Field of study

Correspondence analysis (CA) is a multivariate statistical tool used to visualize and interpret data dependencies. CA has found applications in fields ranging from epidemiology to social sciences. However, current methods used to perform CA do not scale to large, high-dimensional datasets. By re-interpreting the objective in CA using an information-theoretic tool called the principal inertia components, we demonstrate that performing CA is equivalent to solving a functional optimization problem over the space of finite variance functions of two random variable. We show that this optimization problem, in turn, can be efficiently approximated by neural networks. The resulting formulation, called the correspondence analysis neural network (CA-NN), enables CA to be performed at an unprecedented scale. We validate the CA-NN on synthetic data, and demonstrate how it can be used to perform CA on a variety of datasets, including food recipes, wine compositions, and images. Our results outperform traditional methods used in CA, indicating that CA-NN can serve as a new, scalable tool for interpretability and visualization of complex dependencies between random variables.Comment: Accepted to AISTATS 2019. Overlaps with arXiv:1806.0844

arXiv.org e-Print Archive

On the Direction of Discrimination: An Information-Theoretic Analysis of Disparate Impact in Machine Learning

Author: Calmon Flavio P.
Ustun Berk
Wang Hao
Publication venue
Publication date: 11/05/2018
Field of study

In the context of machine learning, disparate impact refers to a form of systematic discrimination whereby the output distribution of a model depends on the value of a sensitive attribute (e.g., race or gender). In this paper, we propose an information-theoretic framework to analyze the disparate impact of a binary classification model. We view the model as a fixed channel, and quantify disparate impact as the divergence in output distributions over two groups. Our aim is to find a correction function that can perturb the input distributions of each group to align their output distributions. We present an optimization problem that can be solved to obtain a correction function that will make the output distributions statistically indistinguishable. We derive closed-form expressions to efficiently compute the correction function, and demonstrate the benefits of our framework on a recidivism prediction problem based on the ProPublica COMPAS dataset

arXiv.org e-Print Archive

A Tunable Measure for Information Leakage

Author: Calmon Flavio P.
Kosut Oliver
Liao Jiachun
Sankar Lalitha
Publication venue
Publication date: 08/06/2018
Field of study

A tunable measure for information leakage called \textit{maximal

\alpha

-leakage} is introduced. This measure quantifies the maximal gain of an adversary in refining a tilted version of its prior belief of any (potentially random) function of a dataset conditioning on a disclosed dataset. The choice of

\alpha

determines the specific adversarial action ranging from refining a belief for

\alpha =1

to guessing the best posterior for

\alpha = \infty

, and for these extremal values this measure simplifies to mutual information (MI) and maximal leakage (MaxL), respectively. For all other

\alpha

this measure is shown to be the Arimoto channel capacity. Several properties of this measure are proven including: (i) quasi-convexity in the mapping between the original and disclosed datasets; (ii) data processing inequalities; and (iii) a composition property.Comment: 7 pages. This paper is the extended version of the conference paper "A Tunable Measure for Information Leakage" accepted by ISIT 201

arXiv.org e-Print Archive

On the Robustness of Information-Theoretic Privacy Measures and Mechanisms

Author: Calmon Flavio P.
Diaz Mario
Sankar Lalitha
Wang Hao
Publication venue
Publication date: 19/03/2020
Field of study

Consider a data publishing setting for a dataset composed by both private and non-private features. The publisher uses an empirical distribution, estimated from

n

i.i.d. samples, to design a privacy mechanism which is applied to new fresh samples afterward. In this paper, we study the discrepancy between the privacy-utility guarantees for the empirical distribution, used to design the privacy mechanism, and those for the true distribution, experienced by the privacy mechanism in practice. We first show that, for any privacy mechanism, these discrepancies vanish at speed

O(1/\sqrt{n})

with high probability. These bounds follow from our main technical results regarding the Lipschitz continuity of the considered information leakage measures. Then we prove that the optimal privacy mechanisms for the empirical distribution approach the corresponding mechanisms for the true distribution as the sample size

n

increases, thereby establishing the statistical consistency of the optimal privacy mechanisms. Finally, we introduce and study uniform privacy mechanisms which, by construction, provide privacy to all the distributions within a neighborhood of the estimated distribution and, thereby, guarantee privacy for the true distribution with high probability

arXiv.org e-Print Archive

Privacy Under Hard Distortion Constraints

Author: Calmon Flavio P.
Kosut Oliver
Liao Jiachun
Sankar Lalitha
Publication venue
Publication date: 31/05/2018
Field of study

We study the problem of data disclosure with privacy guarantees, wherein the utility of the disclosed data is ensured via a \emph{hard distortion} constraint. Unlike average distortion, hard distortion provides a deterministic guarantee of fidelity. For the privacy measure, we use a tunable information leakage measure, namely \textit{maximal

\alpha

-leakage} (

\alpha\in[1,\infty]

), and formulate the privacy-utility tradeoff problem. The resulting solution highlights that under a hard distortion constraint, the nature of the solution remains unchanged for both local and non-local privacy requirements. More precisely, we show that both the optimal mechanism and the optimal tradeoff are invariant for any

\alpha>1

; i.e., the tunable leakage measure only behaves as either of the two extrema, i.e., mutual information for

\alpha=1

and maximal leakage for

\alpha=\infty

.Comment: 5 pages, 1 figur

arXiv.org e-Print Archive

Robustness of Maximal $\alpha$ -Leakage to Side Information

Author: Calmon Flavio P.
Kosut Oliver
Liao Jiachun
Sankar Lalitha
Publication venue
Publication date: 04/04/2019
Field of study

Maximal

\alpha

-leakage is a tunable measure of information leakage based on the accuracy of guessing an arbitrary function of private data based on public data. The parameter

\alpha

determines the loss function used to measure the accuracy of a belief, ranging from log-loss at

\alpha=1

to the probability of error at

\alpha=\infty

. To study the effect of side information on this measure, we introduce and define conditional maximal

\alpha

-leakage. We show that, for a chosen mapping (channel) from the actual (viewed as private) data to the released (public) data and some side information, the conditional maximal

\alpha

-leakage is the supremum (over all side information) of the conditional Arimoto channel capacity where the conditioning is on the side information. We prove that if the side information is conditionally independent of the public data given the private data, the side information cannot increase the information leakage.Comment: This paper has been accepted by ISIT 201

arXiv.org e-Print Archive

Hypothesis Testing under Mutual Information Privacy Constraints in the High Privacy Regime

Author: Calmon Flavio P.
Liao Jiachun
Sankar Lalitha
Tan Vincent Y. F.
Publication venue
Publication date: 26/04/2017
Field of study

Hypothesis testing is a statistical inference framework for determining the true distribution among a set of possible distributions for a given dataset. Privacy restrictions may require the curator of the data or the respondents themselves to share data with the test only after applying a randomizing privacy mechanism. This work considers mutual information (MI) as the privacy metric for measuring leakage. In addition, motivated by the Chernoff-Stein lemma, the relative entropy between pairs of distributions of the output (generated by the privacy mechanism) is chosen as the utility metric. For these metrics, the goal is to find the optimal privacy-utility trade-off (PUT) and the corresponding optimal privacy mechanism for both binary and m-ary hypothesis testing. Focusing on the high privacy regime, Euclidean information-theoretic approximations of the binary and m-ary PUT problems are developed. The solutions for the approximation problems clarify that an MI-based privacy metric preserves the privacy of the source symbols in inverse proportion to their likelihoods.Comment: 13 pages, 7 figures. The paper is submitted to "Transactions on Information Forensics & Security". Comparing to the paper arXiv:1607.00533 "Hypothesis Testing in the High Privacy Limit", the overlapping content is results for binary hypothesis testing with a zero error exponent, and the extended contents are the results for both m-ary hypothesis testing and binary hypothesis testing with nonzero error exponent

arXiv.org e-Print Archive

Optimized Data Pre-Processing for Discrimination Prevention

Author: Calmon Flavio P.
Ramamurthy Karthikeyan Natesan
Varshney Kush R.
Wei Dennis
Publication venue
Publication date: 11/04/2017
Field of study

Non-discrimination is a recognized objective in algorithmic decision making. In this paper, we introduce a novel probabilistic formulation of data pre-processing for reducing discrimination. We propose a convex optimization for learning a data transformation with three goals: controlling discrimination, limiting distortion in individual data samples, and preserving utility. We characterize the impact of limited sample size in accomplishing this objective, and apply two instances of the proposed optimization to datasets, including one on real-world criminal recidivism. The results demonstrate that all three criteria can be simultaneously achieved and also reveal interesting patterns of bias in American society

arXiv.org e-Print Archive

Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions

Author: Calmon Flavio P.
Ustun Berk
Wang Hao
Publication venue
Publication date: 17/05/2019
Field of study

When the performance of a machine learning model varies over groups defined by sensitive attributes (e.g., gender or ethnicity), the performance disparity can be expressed in terms of the probability distributions of the input and output variables over each group. In this paper, we exploit this fact to reduce the disparate impact of a fixed classification model over a population of interest. Given a black-box classifier, we aim to eliminate the performance gap by perturbing the distribution of input variables for the disadvantaged group. We refer to the perturbed distribution as a counterfactual distribution, and characterize its properties for common fairness criteria. We introduce a descent algorithm to learn a counterfactual distribution from data. We then discuss how the estimated distribution can be used to build a data preprocessor that can reduce disparate impact without training a new model. We validate our approach through experiments on real-world datasets, showing that it can repair different forms of disparity without a significant drop in accuracy

arXiv.org e-Print Archive