27 research outputs found
Estimation of KL divergence: optimal minimax rate
The problem of estimating the Kullback-Leibler divergence D(P||Q) between two unknown distributions P and Q is studied, under the assumption that the alphabet size k of the distributions can scale to infinity. The estimation is based on m independent samples drawn from P and n independent samples drawn from Q. It is first shown that there exists no consistent estimator that guarantees asymptotically small worst-case quadratic risk over the set of all pairs of distributions. A restricted set that contains pairs of distributions, with density ratio bounded by a function f(k), is further considered. An augmented plug-in estimator is proposed, and is shown to be consistent if and only if m has an order greater than k∨log^2(f(k)), and n has an order greater than kf(k). Moreover, the minimax quadratic risk is characterized to be within a constant factor of (k/(m log k)+kf(k)/(n log k))^2+log^2 f(k)/m+f(k)/n, if m and n exceed constant factors of k/log(k) and kf(k)/log k, respectively. The lower bound on the minimax quadratic risk is characterized by employing a generalized Le Cam's method. A minimax optimal estimator is then constructed by employing both the polynomial approximation and plug-in approaches
Feature Learning in Image Hierarchies using Functional Maximal Correlation
This paper proposes the Hierarchical Functional Maximal Correlation Algorithm
(HFMCA), a hierarchical methodology that characterizes dependencies across two
hierarchical levels in multiview systems. By framing view similarities as
dependencies and ensuring contrastivity by imposing orthonormality, HFMCA
achieves faster convergence and increased stability in self-supervised
learning. HFMCA defines and measures dependencies within image hierarchies,
from pixels and patches to full images. We find that the network topology for
approximating orthonormal basis functions aligns with a vanilla CNN, enabling
the decomposition of density ratios between neighboring layers of feature maps.
This approach provides powerful interpretability, revealing the resemblance
between supervision and self-supervision through the lens of internal
representations
SGLD-Based Information Criteria and the Over-Parameterized Regime
Double-descent refers to the unexpected drop in test loss of a learning
algorithm beyond an interpolating threshold with over-parameterization, which
is not predicted by information criteria in their classical forms due to the
limitations in the standard asymptotic approach. We update these analyses using
the information risk minimization framework and provide Akaike Information
Criterion (AIC) and Bayesian Information Criterion (BIC) for models learned by
stochastic gradient Langevin dynamics (SGLD). Notably, the AIC and BIC penalty
terms for SGLD correspond to specific information measures, i.e., symmetrized
KL information and KL divergence. We extend this information-theoretic analysis
to over-parameterized models by characterizing the SGLD-based BIC for the
random feature model in the regime where the number of parameters and the
number of samples tend to infinity, with fixed. Our experiments
demonstrate that the refined SGLD-based BIC can track the double-descent curve,
providing meaningful guidance for model selection and revealing new insights
into the behavior of SGLD learning algorithms in the over-parameterized regime
Tighter Expected Generalization Error Bounds via Convexity of Information Measures
Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and Jensen-Shannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds