Search CORE

27 research outputs found

Estimation of KL divergence: optimal minimax rate

Author: Bu Yuheng
Publication venue
Publication date: 01/12/2016
Field of study

The problem of estimating the Kullback-Leibler divergence D(P||Q) between two unknown distributions P and Q is studied, under the assumption that the alphabet size k of the distributions can scale to infinity. The estimation is based on m independent samples drawn from P and n independent samples drawn from Q. It is first shown that there exists no consistent estimator that guarantees asymptotically small worst-case quadratic risk over the set of all pairs of distributions. A restricted set that contains pairs of distributions, with density ratio bounded by a function f(k), is further considered. An augmented plug-in estimator is proposed, and is shown to be consistent if and only if m has an order greater than k∨log^2(f(k)), and n has an order greater than kf(k). Moreover, the minimax quadratic risk is characterized to be within a constant factor of (k/(m log k)+kf(k)/(n log k))^2+log^2 f(k)/m+f(k)/n, if m and n exceed constant factors of k/log(k) and kf(k)/log k, respectively. The lower bound on the minimax quadratic risk is characterized by employing a generalized Le Cam's method. A minimax optimal estimator is then constructed by employing both the polynomial approximation and plug-in approaches

Illinois Digital Environment for Access to Learning and Scholarship Repository

Feature Learning in Image Hierarchies using Functional Maximal Correlation

Author: Bu Yuheng
Hu Bo
Príncipe José C.
Publication venue
Publication date: 31/05/2023
Field of study

This paper proposes the Hierarchical Functional Maximal Correlation Algorithm (HFMCA), a hierarchical methodology that characterizes dependencies across two hierarchical levels in multiview systems. By framing view similarities as dependencies and ensuring contrastivity by imposing orthonormality, HFMCA achieves faster convergence and increased stability in self-supervised learning. HFMCA defines and measures dependencies within image hierarchies, from pixels and patches to full images. We find that the network topology for approximating orthonormal basis functions aligns with a vanilla CNN, enabling the decomposition of density ratios between neighboring layers of feature maps. This approach provides powerful interpretability, revealing the resemblance between supervision and self-supervision through the lens of internal representations

arXiv.org e-Print Archive

SGLD-Based Information Criteria and the Over-Parameterized Regime

Author: Bu Yuheng
Chen Haobo
Wornell Gregory W.
Publication venue
Publication date: 08/06/2023
Field of study

Double-descent refers to the unexpected drop in test loss of a learning algorithm beyond an interpolating threshold with over-parameterization, which is not predicted by information criteria in their classical forms due to the limitations in the standard asymptotic approach. We update these analyses using the information risk minimization framework and provide Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for models learned by stochastic gradient Langevin dynamics (SGLD). Notably, the AIC and BIC penalty terms for SGLD correspond to specific information measures, i.e., symmetrized KL information and KL divergence. We extend this information-theoretic analysis to over-parameterized models by characterizing the SGLD-based BIC for the random feature model in the regime where the number of parameters

p

and the number of samples

n

tend to infinity, with

p/n

fixed. Our experiments demonstrate that the refined SGLD-based BIC can track the double-descent curve, providing meaningful guidance for model selection and revealing new insights into the behavior of SGLD learning algorithms in the over-parameterized regime

arXiv.org e-Print Archive

Tighter Expected Generalization Error Bounds via Convexity of Information Measures

Author: Aminian Gholamali
Bu Yuheng
Rodrigues Miguel RD
Wornell Gregory W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and Jensen-Shannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds

arXiv.org e-Print Archive

UCL Discovery