449 research outputs found

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term

    Full text link
    Deep Neural Networks (DNNs) generalization is known to be closely related to the flatness of minima, leading to the development of Sharpness-Aware Minimization (SAM) for seeking flatter minima and better generalization. In this paper, we revisit the loss of SAM and propose a more general method, called WSAM, by incorporating sharpness as a regularization term. We prove its generalization bound through the combination of PAC and Bayes-PAC techniques, and evaluate its performance on various public datasets. The results demonstrate that WSAM achieves improved generalization, or is at least highly competitive, compared to the vanilla optimizer, SAM and its variants. The code is available at https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers.Comment: 10 pages. Accepted as a conference paper at KDD '2

    Sequential Gibbs Posteriors with Applications to Principal Component Analysis

    Full text link
    Gibbs posteriors are proportional to a prior distribution multiplied by an exponentiated loss function, with a key tuning parameter weighting information in the loss relative to the prior and providing a control of posterior uncertainty. Gibbs posteriors provide a principled framework for likelihood-free Bayesian inference, but in many situations, including a single tuning parameter inevitably leads to poor uncertainty quantification. In particular, regardless of the value of the parameter, credible regions have far from the nominal frequentist coverage even in large samples. We propose a sequential extension to Gibbs posteriors to address this problem. We prove the proposed sequential posterior exhibits concentration and a Bernstein-von Mises theorem, which holds under easy to verify conditions in Euclidean space and on manifolds. As a byproduct, we obtain the first Bernstein-von Mises theorem for traditional likelihood-based Bayesian posteriors on manifolds. All methods are illustrated with an application to principal component analysis

    MMD-FUSE: Learning and Combining Kernels for Two-Sample Testing Without Data Splitting

    Full text link
    We propose novel statistics which maximise the power of a two-sample test based on the Maximum Mean Discrepancy (MMD), by adapting over the set of kernels used in defining it. For finite sets, this reduces to combining (normalised) MMD values under each of these kernels via a weighted soft maximum. Exponential concentration bounds are proved for our proposed statistics under the null and alternative. We further show how these kernels can be chosen in a data-dependent but permutation-independent way, in a well-calibrated test, avoiding data splitting. This technique applies more broadly to general permutation-based MMD testing, and includes the use of deep kernels with features learnt using unsupervised models such as auto-encoders. We highlight the applicability of our MMD-FUSE test on both synthetic low-dimensional and real-world high-dimensional data, and compare its performance in terms of power against current state-of-the-art kernel tests.Comment: 38 pages,8 figures, 1 tabl

    Reinforcement learning in large state action spaces

    Get PDF
    Reinforcement learning (RL) is a promising framework for training intelligent agents which learn to optimize long term utility by directly interacting with the environment. Creating RL methods which scale to large state-action spaces is a critical problem towards ensuring real world deployment of RL systems. However, several challenges limit the applicability of RL to large scale settings. These include difficulties with exploration, low sample efficiency, computational intractability, task constraints like decentralization and lack of guarantees about important properties like performance, generalization and robustness in potentially unseen scenarios. This thesis is motivated towards bridging the aforementioned gap. We propose several principled algorithms and frameworks for studying and addressing the above challenges RL. The proposed methods cover a wide range of RL settings (single and multi-agent systems (MAS) with all the variations in the latter, prediction and control, model-based and model-free methods, value-based and policy-based methods). In this work we propose the first results on several different problems: e.g. tensorization of the Bellman equation which allows exponential sample efficiency gains (Chapter 4), provable suboptimality arising from structural constraints in MAS(Chapter 3), combinatorial generalization results in cooperative MAS(Chapter 5), generalization results on observation shifts(Chapter 7), learning deterministic policies in a probabilistic RL framework(Chapter 6). Our algorithms exhibit provably enhanced performance and sample efficiency along with better scalability. Additionally, we also shed light on generalization aspects of the agents under different frameworks. These properties have been been driven by the use of several advanced tools (e.g. statistical machine learning, state abstraction, variational inference, tensor theory). In summary, the contributions in this thesis significantly advance progress towards making RL agents ready for large scale, real world applications

    Federated Learning You May Communicate Less Often!

    Full text link
    We investigate the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, we study the evolution of the generalization error with the number of communication rounds between the clients and the parameter server, i.e., the effect on the generalization error of how often the local models as computed by the clients are aggregated at the parameter server. We establish PAC-Bayes and rate-distortion theoretic bounds on the generalization error that account explicitly for the effect of the number of rounds, say R∈N R \in \mathbb{N}, in addition to the number of participating devices KK and individual datasets size nn. The bounds, which apply in their generality for a large class of loss functions and learning algorithms, appear to be the first of their kind for the FL setting. Furthermore, we apply our bounds to FL-type Support Vector Machines (FSVM); and we derive (more) explicit bounds on the generalization error in this case. In particular, we show that the generalization error of FSVM increases with RR, suggesting that more frequent communication with the parameter server diminishes the generalization power of such learning algorithms. Combined with that the empirical risk generally decreases for larger values of RR, this indicates that RR might be a parameter to optimize in order to minimize the population risk of FL algorithms. Moreover, specialized to the case R=1R=1 (sometimes referred to as "one-shot" FL or distributed learning) our bounds suggest that the generalization error of the FL setting decreases faster than that of centralized learning by a factor of O(log⁥(K)/K)\mathcal{O}(\sqrt{\log(K)/K}), thereby generalizing recent findings in this direction to arbitrary loss functions and algorithms. The results of this paper are also validated on some experiments

    Robust Out-of-Distribution Detection in Deep Classifiers

    Get PDF
    Over the past decade, deep learning has gone from a fringe discipline of computer science to a major driver of innovation across a large number of industries. The deployment of such rapidly developing technology in safety-critical applications necessitates the careful study and mitigation of potential failure modes. Indeed, many deep learning models are overconfident in their predictions, are unable to flag out-of-distribution examples that are clearly unrelated to the task they were trained on and are vulnerable to adversarial vulnerabilities, where a small change in the input leads to a large change in the model’s prediction. In this dissertation, we study the relation between these issues in deep learning based vision classifiers. First, we benchmark various methods that have been proposed to enable deep learning meth- ods to detect out-of-distribution examples and we show that a classifier’s predictive confidence is well-suited for this task, if the classifier has had access to a large and diverse out-distribution at train time. We theoretically investigate how different out-of-distribution detection methods are related and show that several seemingly different approaches are actually modeling the same core quantities. In the second part we study the adversarial robustness of a classifier’s confidence on out- of-distribution data. Concretely, we show that several previous techniques for adversarial robustness can be combined to create a model that inherits each method’s strength while sig- nificantly reducing their respective drawbacks. In addition, we demonstrate that the enforce- ment of adversarially robust low confidence on out-of-distribution data enhances the inherent interpretability of the model by imbuing the classifier with certain generative properties that can be used to query the model for counterfactual explanations for its decisions. In the third part of this dissertation we will study the problem of issuing mathematically provable certificates for the adversarial robustness of a model’s confidence on out-of-distribution data. We develop two different approaches to this problem and show that they have comple- mentary strength and weaknesses. The first method is easy to train, puts no restrictions on the architecture that our classifier can use and provably ensures that the classifier will have low confidence on data very far away. However, it only provides guarantees for very specific types of adversarial perturbations and only for data that is very easy to distinguish from the in-distribution. The second approach works for more commonly studied sets of adversarial perturbations and on much more challenging out-distribution data, but puts heavy restrictions on the architecture that can be used and thus the achievable accuracy. It also does not guar- antee low confidence on asymptotically far away data. In the final chapter of this dissertation we show how ideas from both of these techniques can be combined in a way that preserves all of their strengths while inheriting none of their weaknesses. Thus, this thesis outlines how to develop high-performing classifiers that provably know when they do not know

    Supervised Learning in Time-dependent Environments with Performance Guarantees

    Get PDF
    In practical scenarios, it is common to learn from a sequence of related problems (tasks). Such tasks are usually time-dependent in the sense that consecutive tasks are often significantly more similar. Time-dependency is common in multiple applications such as load forecasting, spam main filtering, and face emotion recognition. For instance, in the problem of load forecasting, the consumption patterns in consecutive time periods are significantly more similar since human habits and weather factors change gradually over time. Learning from a sequence tasks holds promise to enable accurate performance even with few samples per task by leveraging information from different tasks. However, harnessing the benefits of learning from a sequence of tasks is challenging since tasks are characterized by different underlying distributions. Most existing techniques are designed for situations where the tasks’ similarities do not depend on their order in the sequence. Existing techniques designed for timedependent tasks adapt to changes between consecutive tasks accounting for a scalar rate of change by using a carefully chosen parameter such as a learning rate or a weight factor. However, the tasks’ changes are commonly multidimensional, i.e., the timedependency often varies across different statistical characteristics describing the tasks. For instance, in the problem of load forecasting, the statistical characteristics related to weather factors often change differently from those related to generation. In this dissertation, we establish methodologies for supervised learning from a sequence of time-dependent tasks that effectively exploit information from all tasks, provide multidimensional adaptation to tasks’ changes, and provide computable tight performance guarantees. We develop methods for supervised learning settings where tasks arrive over time including techniques for supervised classification under concept drift (SCD) and techniques for continual learning (CL). In addition, we present techniques for load forecasting that can adapt to time changes in consumption patterns and assess intrinsic uncertainties in load demand. The numerical results show that the proposed methodologies can significantly improve the performance of existing methods using multiple benchmark datasets. This dissertation makes theoretical contributions leading to efficient algorithms for multiple machine learning scenarios that provide computable performance guarantees and superior performance than state-of-the-art techniques

    Model-based causal feature selection for general response types

    Full text link
    Discovering causal relationships from observational data is a fundamental yet challenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a method for causal feature selection which requires data from heterogeneous settings and exploits that causal models are invariant. ICP has been extended to general additive noise models and to nonparametric settings using conditional independence tests. However, the latter often suffer from low power (or poor type I error control) and additive noise models are not suitable for applications in which the response is not measured on a continuous scale, but reflects categories or counts. Here, we develop transformation-model (TRAM) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses (these model classes, generally, do not allow for identifiability when there is no exogenous heterogeneity). As an invariance test, we propose TRAM-GCM based on the expected conditional covariance between environments and score residuals with uniform asymptotic level guarantees. For the special case of linear shift TRAMs, we also consider TRAM-Wald, which tests invariance based on the Wald statistic. We provide an open-source R package 'tramicp' and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients.Comment: Code available at https://github.com/LucasKook/tramicp.gi

    Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations

    Full text link
    In real-world reinforcement learning (RL) systems, various forms of impaired observability can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish near-optimal regret bounds, of the form O~(poly(H)SAK)\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK}), for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability
    • 

    corecore