Testing with Non-identically Distributed Samples

Abstract

We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size kk, p1,p2,,pT\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_T, and we obtain cc independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, pavg\textbf{p}_{\mathrm{avg}}. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with c=1c=1 samples from each distribution, Θ(k/ε2)\Theta(k/\varepsilon^2) samples are necessary and sufficient to learn pavg\textbf{p}_{\mathrm{avg}} to within error ε\varepsilon in TV distance. To test uniformity or identity -- distinguishing the case that pavg\textbf{p}_{\mathrm{avg}} is equal to some reference distribution, versus has 1\ell_1 distance at least ε\varepsilon from the reference distribution, we show that a linear number of samples in kk is necessary given c=1c=1 samples from each distribution. In contrast, for c2c \ge 2, we recover the usual sublinear sample testing of the i.i.d. setting: we show that O(k/ε2+1/ε4)O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4) samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where εk1/4\varepsilon \ge k^{-1/4}. Additionally, we show that in the c=2c=2 case, there is a constant ρ>0\rho > 0 such that even in the linear regime with ρk\rho k samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same pi\textbf{p}_i) can perform uniformity testing

    Similar works

    Full text

    thumbnail-image

    Available Versions