Testing with Non-identically Distributed Samples

Garg, Shivam; Pabbaraju, Chirag; Shiragur, Kirankumar; Valiant, Gregory

Testing with Non-identically Distributed Samples

Authors: Shivam Garg
Chirag Pabbaraju
Kirankumar Shiragur
Gregory Valiant
Publication date: 18 November 2023
Publisher

Abstract

We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size

k

,

\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_T

, and we obtain

c

independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution,

\textbf{p}_{\mathrm{avg}}

. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with

c=1

samples from each distribution,

\Theta(k/\varepsilon^2)

samples are necessary and sufficient to learn

\textbf{p}_{\mathrm{avg}}

to within error

\varepsilon

in TV distance. To test uniformity or identity -- distinguishing the case that

\textbf{p}_{\mathrm{avg}}

is equal to some reference distribution, versus has

\ell_1

distance at least

\varepsilon

from the reference distribution, we show that a linear number of samples in

k

is necessary given

c=1

samples from each distribution. In contrast, for

c \ge 2

, we recover the usual sublinear sample testing of the i.i.d. setting: we show that

O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)

samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where

\varepsilon \ge k^{-1/4}

. Additionally, we show that in the

c=2

case, there is a constant

\rho > 0

such that even in the linear regime with

\rho k

samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same

\textbf{p}_i

) can perform uniformity testing

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2311.11194

Last time updated on 07/05/2024