5 research outputs found
More Communication Does Not Result in Smaller Generalization Error in Federated Learning
We study the generalization error of statistical learning models in a
Federated Learning (FL) setting. Specifically, there are devices or
clients, each holding an independent own dataset of size . Individual
models, learned locally via Stochastic Gradient Descent, are aggregated
(averaged) by a central server into a global model and then sent back to the
devices. We consider multiple (say ) rounds of model
aggregation and study the effect of on the generalization error of the
final aggregated model. We establish an upper bound on the generalization error
that accounts explicitly for the effect of (in addition to the number of
participating devices and dataset size ). It is observed that, for fixed
, the bound increases with , suggesting that the generalization of
such learning algorithms is negatively affected by more frequent communication
with the parameter server. Combined with the fact that the empirical risk,
however, generally decreases for larger values of , this indicates that
might be a parameter to optimize to reduce the population risk of FL
algorithms. The results of this paper, which extend straightforwardly to the
heterogeneous data setting, are also illustrated through numerical examples.Comment: Extended version of paper accepted at ISIT 202
Generalization error bounds for iterative learning algorithms with bounded updates
This paper explores the generalization characteristics of iterative learning
algorithms with bounded updates for non-convex loss functions, employing
information-theoretic techniques. Our key contribution is a novel bound for the
generalization error of these algorithms with bounded updates, extending beyond
the scope of previous works that only focused on Stochastic Gradient Descent
(SGD). Our approach introduces two main novelties: 1) we reformulate the mutual
information as the uncertainty of updates, providing a new perspective, and 2)
instead of using the chaining rule of mutual information, we employ a variance
decomposition technique to decompose information across iterations, allowing
for a simpler surrogate process. We analyze our generalization bound under
various settings and demonstrate improved bounds when the model dimension
increases at the same rate as the number of training data samples. To bridge
the gap between theory and practice, we also examine the previously observed
scaling behavior in large language models. Ultimately, our work takes a further
step for developing practical generalization theories
Topological generalization bounds for discrete-time stochastic optimization algorithms
We present a novel set of rigorous and computationally efficient topology-based complexity notions that exhibit a strong correlation with the generalization gap in modern deep neural networks (DNNs). DNNs show remarkable generalization properties, yet the source of these capabilities remains elusive, defying the established statistical learning theory. Recent studies have revealed that properties of training trajectories can be indicative of generalization. Building on this insight, state-of-the-art methods have leveraged the topology of these trajectories, particularly their fractal dimension, to quantify generalization. Most existing works compute this quantity by assuming continuous- or infinite-time training dynamics, complicating the development of practical estimators capable of accurately predicting generalization without access to test data. In this paper, we respect the discrete-time nature of training trajectories and investigate the underlying topological quantities that can be amenable to topological data analysis tools. This leads to a new family of reliable topological complexity measures that provably bound the generalization error, eliminating the need for restrictive geometric assumptions. These measures are computationally friendly, enabling us to propose simple yet effective algorithms for computing generalization indices. Moreover, our flexible framework can be extended to different domains, tasks, and architectures. Our experimental results demonstrate that our new complexity measures correlate highly with generalization error in industry-standards architectures such as transformers and deep graph networks. Our approach consistently outperforms existing topological bounds across a wide range of datasets, models, and optimizers, highlighting the practical relevance and effectiveness of our complexity measures
Bayesian optimisation for automated machine learning
In this thesis, we develop a rich family of efficient and performant Bayesian optimisation (BO) methods to tackle various AutoML tasks. We first introduce a fast information-theoretic BO method, FITBO, that overcomes the computation bottleneck of information-theoretic acquisition functions while maintaining their competitiveness on the noisy optimisation problems frequently encountered in AutoML. We then improve on the idea of local penalisation and develop an asynchronous batch BO solution, PLAyBOOK, to enable more efficient use of parallel computing resources when evaluation runtime varies across configurations. In view of the fact that many practical AutoML problems involve a mixture of multiple continuous and multiple categorical variables, we propose a new framework, named Continuous and Categorical BO (CoCaBO) to handle such mixed-type input spaces. CoCaBO merges the strengths of multi-armed bandits on categorical inputs and that of BO on continuous space, and uses a tailored kernel to permit information sharing across different categorical variables. We also extend CoCaBO by harnessing the concept of local trust region to achieve competitive performance on high-dimensional optimisation problems with mixed input types.
Beyond hyper-parameter tuning, we also investigate the novel use of BO on two important AutoML applications: black-box adversarial attack and neural architecture search. For the former (adversarial attack), we introduce the first BO-based attacks on image and graph classifiers; by actively querying the unknown victim classifier, our BO attacks can successfully find adversarial perturbations with many fewer attempts than competing baselines. They can thus serve as efficient tools for assessing the robustness of models suggested by AutoML. For the latter (neural architecture search), we leverage the Weisfeiler-Lehamn graph kernel to empower our BO search strategy, NAS-BOWL, to naturally handle the directed acyclic graph representation of architectures. Besides achieving superior query efficiency, our NAS-BOWL also returns interpretable sub-features that help explain the architecture performance, thus marking the first step towards interpretable neural architecture search. Finally, we examine the most computation-intense step in AutoML pipeline: generalisation performance evaluation for a new configuration. We propose a cheap yet reliable test performance estimator based on a simple measure of training speed. It consistently outperforms various existing estimators on on a wide range of architecture search spaces and and can be easily incorporated into different search strategies, including BO, to improve the cost efficiency