This paper studies convergence of empirical measures smoothed by a Gaussian
kernel. Specifically, consider approximating PβNΟβ, for
NΟββN(0,Ο2Idβ), by
P^nββNΟβ, where P^nβ is the empirical measure,
under different statistical distances. The convergence is examined in terms of
the Wasserstein distance, total variation (TV), Kullback-Leibler (KL)
divergence, and Ο2-divergence. We show that the approximation error under
the TV distance and 1-Wasserstein distance (W1β) converges at rate
eO(d)nβ21β in remarkable contrast to a typical
nβd1β rate for unsmoothed W1β (and dβ₯3). For the
KL divergence, squared 2-Wasserstein distance (W22β), and
Ο2-divergence, the convergence rate is eO(d)nβ1, but only if P
achieves finite input-output Ο2 mutual information across the additive
white Gaussian noise channel. If the latter condition is not met, the rate
changes to Ο(nβ1) for the KL divergence and W22β, while
the Ο2-divergence becomes infinite - a curious dichotomy. As a main
application we consider estimating the differential entropy
h(PβNΟβ) in the high-dimensional regime. The distribution
P is unknown but n i.i.d samples from it are available. We first show that
any good estimator of h(PβNΟβ) must have sample complexity
that is exponential in d. Using the empirical approximation results we then
show that the absolute-error risk of the plug-in estimator converges at the
parametric rate eO(d)nβ21β, thus establishing the minimax
rate-optimality of the plug-in. Numerical results that demonstrate a
significant empirical superiority of the plug-in approach to general-purpose
differential entropy estimators are provided.Comment: arXiv admin note: substantial text overlap with arXiv:1810.1158