449 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term
Deep Neural Networks (DNNs) generalization is known to be closely related to
the flatness of minima, leading to the development of Sharpness-Aware
Minimization (SAM) for seeking flatter minima and better generalization. In
this paper, we revisit the loss of SAM and propose a more general method,
called WSAM, by incorporating sharpness as a regularization term. We prove its
generalization bound through the combination of PAC and Bayes-PAC techniques,
and evaluate its performance on various public datasets. The results
demonstrate that WSAM achieves improved generalization, or is at least highly
competitive, compared to the vanilla optimizer, SAM and its variants. The code
is available at
https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers.Comment: 10 pages. Accepted as a conference paper at KDD '2
Sequential Gibbs Posteriors with Applications to Principal Component Analysis
Gibbs posteriors are proportional to a prior distribution multiplied by an
exponentiated loss function, with a key tuning parameter weighting information
in the loss relative to the prior and providing a control of posterior
uncertainty. Gibbs posteriors provide a principled framework for
likelihood-free Bayesian inference, but in many situations, including a single
tuning parameter inevitably leads to poor uncertainty quantification. In
particular, regardless of the value of the parameter, credible regions have far
from the nominal frequentist coverage even in large samples. We propose a
sequential extension to Gibbs posteriors to address this problem. We prove the
proposed sequential posterior exhibits concentration and a Bernstein-von Mises
theorem, which holds under easy to verify conditions in Euclidean space and on
manifolds. As a byproduct, we obtain the first Bernstein-von Mises theorem for
traditional likelihood-based Bayesian posteriors on manifolds. All methods are
illustrated with an application to principal component analysis
MMD-FUSE: Learning and Combining Kernels for Two-Sample Testing Without Data Splitting
We propose novel statistics which maximise the power of a two-sample test
based on the Maximum Mean Discrepancy (MMD), by adapting over the set of
kernels used in defining it. For finite sets, this reduces to combining
(normalised) MMD values under each of these kernels via a weighted soft
maximum. Exponential concentration bounds are proved for our proposed
statistics under the null and alternative. We further show how these kernels
can be chosen in a data-dependent but permutation-independent way, in a
well-calibrated test, avoiding data splitting. This technique applies more
broadly to general permutation-based MMD testing, and includes the use of deep
kernels with features learnt using unsupervised models such as auto-encoders.
We highlight the applicability of our MMD-FUSE test on both synthetic
low-dimensional and real-world high-dimensional data, and compare its
performance in terms of power against current state-of-the-art kernel tests.Comment: 38 pages,8 figures, 1 tabl
Reinforcement learning in large state action spaces
Reinforcement learning (RL) is a promising framework for training intelligent agents which learn to optimize long term utility by directly interacting with the environment. Creating RL methods which scale to large state-action spaces is a critical problem towards ensuring real world deployment of RL systems. However, several challenges limit the applicability of RL to large scale settings. These include difficulties with exploration, low sample efficiency, computational intractability, task constraints like decentralization and lack of guarantees about important properties like performance, generalization and robustness in potentially unseen scenarios.
This thesis is motivated towards bridging the aforementioned gap. We propose several principled algorithms and frameworks for studying and addressing the above challenges RL. The proposed methods cover a wide range of RL settings (single and multi-agent systems (MAS) with all the variations in the latter, prediction and control, model-based and model-free methods, value-based and policy-based methods). In this work we propose the first results on several different problems: e.g. tensorization of the Bellman equation which allows exponential sample efficiency gains (Chapter 4), provable suboptimality arising from structural constraints in MAS(Chapter 3), combinatorial generalization results in cooperative MAS(Chapter 5), generalization results on observation shifts(Chapter 7), learning deterministic policies in a probabilistic RL framework(Chapter 6). Our algorithms exhibit provably enhanced performance and sample efficiency along with better scalability. Additionally, we also shed light on generalization aspects of the agents under different frameworks. These properties have been been driven by the use of several advanced tools (e.g. statistical machine learning, state abstraction, variational inference, tensor theory).
In summary, the contributions in this thesis significantly advance progress towards making RL agents ready for large scale, real world applications
Federated Learning You May Communicate Less Often!
We investigate the generalization error of statistical learning models in a
Federated Learning (FL) setting. Specifically, we study the evolution of the
generalization error with the number of communication rounds between the
clients and the parameter server, i.e., the effect on the generalization error
of how often the local models as computed by the clients are aggregated at the
parameter server. We establish PAC-Bayes and rate-distortion theoretic bounds
on the generalization error that account explicitly for the effect of the
number of rounds, say , in addition to the number of
participating devices and individual datasets size . The bounds, which
apply in their generality for a large class of loss functions and learning
algorithms, appear to be the first of their kind for the FL setting.
Furthermore, we apply our bounds to FL-type Support Vector Machines (FSVM); and
we derive (more) explicit bounds on the generalization error in this case. In
particular, we show that the generalization error of FSVM increases with ,
suggesting that more frequent communication with the parameter server
diminishes the generalization power of such learning algorithms. Combined with
that the empirical risk generally decreases for larger values of , this
indicates that might be a parameter to optimize in order to minimize the
population risk of FL algorithms. Moreover, specialized to the case
(sometimes referred to as "one-shot" FL or distributed learning) our bounds
suggest that the generalization error of the FL setting decreases faster than
that of centralized learning by a factor of ,
thereby generalizing recent findings in this direction to arbitrary loss
functions and algorithms. The results of this paper are also validated on some
experiments
Robust Out-of-Distribution Detection in Deep Classifiers
Over the past decade, deep learning has gone from a fringe discipline of computer science
to a major driver of innovation across a large number of industries. The deployment of such
rapidly developing technology in safety-critical applications necessitates the careful study and
mitigation of potential failure modes. Indeed, many deep learning models are overconfident in
their predictions, are unable to flag out-of-distribution examples that are clearly unrelated to
the task they were trained on and are vulnerable to adversarial vulnerabilities, where a small
change in the input leads to a large change in the modelâs prediction. In this dissertation, we
study the relation between these issues in deep learning based vision classifiers.
First, we benchmark various methods that have been proposed to enable deep learning meth-
ods to detect out-of-distribution examples and we show that a classifierâs predictive confidence
is well-suited for this task, if the classifier has had access to a large and diverse out-distribution
at train time. We theoretically investigate how different out-of-distribution detection methods
are related and show that several seemingly different approaches are actually modeling the
same core quantities.
In the second part we study the adversarial robustness of a classifierâs confidence on out-
of-distribution data. Concretely, we show that several previous techniques for adversarial
robustness can be combined to create a model that inherits each methodâs strength while sig-
nificantly reducing their respective drawbacks. In addition, we demonstrate that the enforce-
ment of adversarially robust low confidence on out-of-distribution data enhances the inherent
interpretability of the model by imbuing the classifier with certain generative properties that
can be used to query the model for counterfactual explanations for its decisions.
In the third part of this dissertation we will study the problem of issuing mathematically
provable certificates for the adversarial robustness of a modelâs confidence on out-of-distribution
data. We develop two different approaches to this problem and show that they have comple-
mentary strength and weaknesses. The first method is easy to train, puts no restrictions on
the architecture that our classifier can use and provably ensures that the classifier will have
low confidence on data very far away. However, it only provides guarantees for very specific
types of adversarial perturbations and only for data that is very easy to distinguish from the
in-distribution. The second approach works for more commonly studied sets of adversarial
perturbations and on much more challenging out-distribution data, but puts heavy restrictions
on the architecture that can be used and thus the achievable accuracy. It also does not guar-
antee low confidence on asymptotically far away data. In the final chapter of this dissertation
we show how ideas from both of these techniques can be combined in a way that preserves all
of their strengths while inheriting none of their weaknesses. Thus, this thesis outlines how to
develop high-performing classifiers that provably know when they do not know
Supervised Learning in Time-dependent Environments with Performance Guarantees
In practical scenarios, it is common to learn from a sequence of related problems (tasks).
Such tasks are usually time-dependent in the sense that consecutive tasks are often
significantly more similar. Time-dependency is common in multiple applications such
as load forecasting, spam main filtering, and face emotion recognition. For instance, in
the problem of load forecasting, the consumption patterns in consecutive time periods
are significantly more similar since human habits and weather factors change gradually
over time. Learning from a sequence tasks holds promise to enable accurate performance
even with few samples per task by leveraging information from different tasks. However,
harnessing the benefits of learning from a sequence of tasks is challenging since tasks
are characterized by different underlying distributions.
Most existing techniques are designed for situations where the tasksâ similarities
do not depend on their order in the sequence. Existing techniques designed for timedependent
tasks adapt to changes between consecutive tasks accounting for a scalar
rate of change by using a carefully chosen parameter such as a learning rate or a weight
factor. However, the tasksâ changes are commonly multidimensional, i.e., the timedependency
often varies across different statistical characteristics describing the tasks.
For instance, in the problem of load forecasting, the statistical characteristics related
to weather factors often change differently from those related to generation.
In this dissertation, we establish methodologies for supervised learning from a sequence
of time-dependent tasks that effectively exploit information from all tasks,
provide multidimensional adaptation to tasksâ changes, and provide computable tight
performance guarantees. We develop methods for supervised learning settings where
tasks arrive over time including techniques for supervised classification under concept
drift (SCD) and techniques for continual learning (CL). In addition, we present techniques
for load forecasting that can adapt to time changes in consumption patterns
and assess intrinsic uncertainties in load demand. The numerical results show that the
proposed methodologies can significantly improve the performance of existing methods
using multiple benchmark datasets. This dissertation makes theoretical contributions
leading to efficient algorithms for multiple machine learning scenarios that provide computable
performance guarantees and superior performance than state-of-the-art techniques
Model-based causal feature selection for general response types
Discovering causal relationships from observational data is a fundamental yet
challenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a
method for causal feature selection which requires data from heterogeneous
settings and exploits that causal models are invariant. ICP has been extended
to general additive noise models and to nonparametric settings using
conditional independence tests. However, the latter often suffer from low power
(or poor type I error control) and additive noise models are not suitable for
applications in which the response is not measured on a continuous scale, but
reflects categories or counts. Here, we develop transformation-model (TRAM)
based ICP, allowing for continuous, categorical, count-type, and
uninformatively censored responses (these model classes, generally, do not
allow for identifiability when there is no exogenous heterogeneity). As an
invariance test, we propose TRAM-GCM based on the expected conditional
covariance between environments and score residuals with uniform asymptotic
level guarantees. For the special case of linear shift TRAMs, we also consider
TRAM-Wald, which tests invariance based on the Wald statistic. We provide an
open-source R package 'tramicp' and evaluate our approach on simulated data and
in a case study investigating causal features of survival in critically ill
patients.Comment: Code available at https://github.com/LucasKook/tramicp.gi
Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations
In real-world reinforcement learning (RL) systems, various forms of impaired
observability can complicate matters. These situations arise when an agent is
unable to observe the most recent state of the system due to latency or lossy
channels, yet the agent must still make real-time decisions. This paper
introduces a theoretical investigation into efficient RL in control systems
where agents must act with delayed and missing state observations. We establish
near-optimal regret bounds, of the form , for RL in both the delayed and missing observation settings.
Despite impaired observability posing significant challenges to the policy
class and planning, our results demonstrate that learning remains efficient,
with the regret bound optimally depending on the state-action size of the
original system. Additionally, we provide a characterization of the performance
of the optimal policy under impaired observability, comparing it to the optimal
value obtained with full observability
- âŠ