642 research outputs found
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Parameter Selection and Pre-Conditioning for a Graph Form Solver
In a recent paper, Parikh and Boyd describe a method for solving a convex
optimization problem, where each iteration involves evaluating a proximal
operator and projection onto a subspace. In this paper we address the critical
practical issues of how to select the proximal parameter in each iteration, and
how to scale the original problem variables, so as the achieve reliable
practical performance. The resulting method has been implemented as an
open-source software package called POGS (Proximal Graph Solver), that targets
multi-core and GPU-based systems, and has been tested on a wide variety of
practical problems. Numerical results show that POGS can solve very large
problems (with, say, more than a billion coefficients in the data), to modest
accuracy in a few tens of seconds. As just one example, a radiation treatment
planning problem with around 100 million coefficients in the data can be solved
in a few seconds, as compared to around one hour with an interior-point method.Comment: 28 pages, 1 figure, 1 open source implementatio
A Bandit Approach to Maximum Inner Product Search
There has been substantial research on sub-linear time approximate algorithms
for Maximum Inner Product Search (MIPS). To achieve fast query time,
state-of-the-art techniques require significant preprocessing, which can be a
burden when the number of subsequent queries is not sufficiently large to
amortize the cost. Furthermore, existing methods do not have the ability to
directly control the suboptimality of their approximate results with
theoretical guarantees. In this paper, we propose the first approximate
algorithm for MIPS that does not require any preprocessing, and allows users to
control and bound the suboptimality of the results. We cast MIPS as a Best Arm
Identification problem, and introduce a new bandit setting that can fully
exploit the special structure of MIPS. Our approach outperforms
state-of-the-art methods on both synthetic and real-world datasets.Comment: AAAI 201
The Hierarchical Discrete Pursuit Learning Automaton: A Novel Scheme With Fast Convergence and Epsilon-Optimality
Author's accepted manuscript© 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Since the early 1960s, the paradigm of learning automata (LA) has experienced abundant interest. Arguably, it has also served as the foundation for the phenomenon and field of reinforcement learning (RL). Over the decades, new concepts and fundamental principles have been introduced to increase the LA’s speed and accuracy. These include using probability updating functions, discretizing the probability space, and using the “Pursuit” concept. Very recently, the concept of incorporating “structure” into the ordering of the LA’s actions has improved both the speed and accuracy of the corresponding hierarchical machines, when the number of actions is large. This has led to the ϵ -optimal hierarchical continuous pursuit LA (HCPA). This article pioneers the inclusion of all the above-mentioned phenomena into a new single LA, leading to the novel hierarchical discretized pursuit LA (HDPA). Indeed, although the previously proposed HCPA is powerful, its speed has an impediment when any action probability is close to unity, because the updates of the components of the probability vector are correspondingly smaller when any action probability becomes closer to unity. We propose here, the novel HDPA, where we infuse the phenomenon of discretization into the action probability vector’s updating functionality, and which is invoked recursively at every stage of the machine’s hierarchical structure. This discretized functionality does not possess the same impediment, because discretization prohibits it. We demonstrate the HDPA’s robustness and validity by formally proving the ϵ -optimality by utilizing the moderation property. We also invoke the submartingale characteristic at every level, to prove that the action probability of the optimal action converges to unity as time goes to infinity. Apart from the new machine being ϵ -optimal, the numerical results demonstrate that the number of iterations required for convergence is significantly reduce...acceptedVersio
Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems.
International audienceIn the framework of fully cooperative multi-agent systems, independent (non-communicative) agents that learn by reinforcement must overcome several difficulties to manage to coordinate. This paper identifies several challenges responsible for the non-coordination of independent agents: Pareto-selection, nonstationarity, stochasticity, alter-exploration and shadowed equilibria. A selection of multi-agent domains is classified according to those challenges: matrix games, Boutilier's coordination game, predators pursuit domains and a special multi-state game. Moreover the performance of a range of algorithms for independent reinforcement learners is evaluated empirically. Those algorithms are Q-learning variants: decentralized Q-learning, distributed Q-learning, hysteretic Q-learning, recursive FMQ and WoLF PHC. An overview of the learning algorithms' strengths and weaknesses against each challenge concludes the paper and can serve as a basis for choosing the appropriate algorithm for a new domain. Furthermore, the distilled challenges may assist in the design of new learning algorithms that overcome these problems and achieve higher performance in multi-agent applications
Implicit regularization in AI meets generalized hardness of approximation in optimization -- Sharp results for diagonal linear networks
Understanding the implicit regularization imposed by neural network
architectures and gradient based optimization methods is a key challenge in
deep learning and AI. In this work we provide sharp results for the implicit
regularization imposed by the gradient flow of Diagonal Linear Networks (DLNs)
in the over-parameterized regression setting and, potentially surprisingly,
link this to the phenomenon of phase transitions in generalized hardness of
approximation (GHA). GHA generalizes the phenomenon of hardness of
approximation from computer science to, among others, continuous and robust
optimization. It is well-known that the -norm of the gradient flow of
DLNs with tiny initialization converges to the objective function of basis
pursuit. We improve upon these results by showing that the gradient flow of
DLNs with tiny initialization approximates minimizers of the basis pursuit
optimization problem (as opposed to just the objective function), and we obtain
new and sharp convergence bounds w.r.t.\ the initialization size. Non-sharpness
of our results would imply that the GHA phenomenon would not occur for the
basis pursuit optimization problem -- which is a contradiction -- thus implying
sharpness. Moreover, we characterize minimizer of the
basis pursuit problem is chosen by the gradient flow whenever the minimizer is
not unique. Interestingly, this depends on the depth of the DLN
- …