Learning Ability of Deep ReLU Networks: Pairwise Tasks and Gradient Descent Methods

Zhou, Junyu

thesis

oai:ses.library.usyd.edu.au:2123/34229

Learning Ability of Deep ReLU Networks: Pairwise Tasks and Gradient Descent Methods

Authors: Junyu Zhou
Publication date: 1 January 2025
Publisher: Faculty of Science, School of Mathematics and Statistics

Abstract

Deep neural networks (DNNs) have become central to modern machine learning due to their strong empirical performance. However, their theoretical understanding—especially regarding generalization—remains limited. This thesis advances the theory of deep ReLU networks through two lenses: pairwise learning tasks and gradient descent methods. For pairwise learning, we study generalization in non-parametric estimation without relying on restrictive convexity or VC-class assumptions. We establish sharp oracle inequalities for empirical minimizers under general hypothesis spaces and Lipschitz pairwise losses. Applied to pairwise least squares regression, our bounds match known minimax rates up to logarithmic terms. A key innovation is constructing a structured deep ReLU network approximating the true predictor, forming a target hypothesis space with controlled complexity. This framework successfully handles problems beyond the reach of existing theories. For metric and similarity learning, we exploit the structure of the true metric. By deriving its form under hinge loss, we approximate it using structured deep ReLU networks and analyze the excess generalization error by bounding the approximation and the estimation errors. An optimal excess risk rate is achieved, marking the first known such analysis for metric/similarity learning. We also explore extensions to general losses. For gradient descent methods, we study GD and SGD for overparameterized deep ReLU networks in the NTK regime. Prior work mainly covers shallow networks; we fill this gap by establishing the first minimax-optimal generalization rates for GD/SGD with deep architectures. Under polynomial width scaling, our results show these methods can match the generalization performance of kernel approaches

Similar works

Full text

Sydney eScholarship

oai:ses.library.usyd.edu.au:21...

Last time updated on 31/08/2025

This paper was published in Sydney eScholarship.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.