Search CORE

349 research outputs found

An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

Author: Principe Jose C.
Sledge Isaac J.
Publication venue: 'MDPI AG'
Publication date: 01/02/2018
Field of study

In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.Comment: Entrop

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Control of Sample Complexity and Regret in Bandits using Fractional Moments

Author: Ananda Narayanan
Balaraman Ravindran
Publication venue
Publication date: 01/01/2011
Field of study

Abstract One key facet of learning through reinforcements is the dilemma between exploration to find profitable actions and exploitation to act optimal according to the observations already made. We analyze this explore/exploit situation on Bandit problems in stateless environments. We propose a family of learning algorithms for bandit problems based on fractional expectation of rewards acquired. The algorithms can be controlled to behave optimal with respect to sample complexity or regret, through a single parameter. The family is theoretically shown to contain algorithms that converge on an -optimal arm and achieve O(n) sample complexity, a theoretical minimum. The family is also shown to include algorithms that achieve the optimal logarithmic regret proved b

CiteSeerX

Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters

Author: C. Dimitrakakis
J. Vermorel
K.S. Narendra
O.C. Granmo
P. Auer
R. Dearden
R.S. Sutton
S. Russel
T.M. Mitchell
W.R. Thompson
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2010
Field of study

The multi-armed bandit problem is a classical optimization problem where an agent sequentially pulls one of multiple arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Dynamically changing (non-stationary) bandit problems are particularly challenging because each change of the reward distributions may progressively degrade the performance of any fixed strategy. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel solution scheme for bandit problems with non-stationary normally distributed rewards. The scheme is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling Kalman Filters, and on random sampling from these posteriors. Furthermore, it is able to track the better actions, thus supporting non-stationary bandit problems. Extensive experiments demonstrate that our scheme outperforms recently proposed bandit playing algorithms, not only in non-stationary environments, but in stationary environments also. Furthermore, our scheme is robust to inexact parameter settings. We thus believe that our methodology opens avenues for obtaining improved novel solutions

Crossref

NORA - Norwegian Open Research Archives

Agder University Research Archive

Why VAR Fails: Long Memory and Extreme Events in Financial Markets

Author: Cornelis A. Los
Publication venue
Publication date
Field of study

The Value-at-Risk (VAR) measure is based on only the second moment of a rates of return distribution. It is an insufficient risk performance measure, since it ignores both the higher moments of the pricing distributions, like skewness and kurtosis, and all the fractional moments resulting from the long - term dependencies (long memory) of dynamic market pricing. Not coincidentally, the VaR methodology also devotes insufficient attention to the truly extreme financial events, i.e., those events that are catastrophic and that are clustering because of this long memory. Since the usual stationarity and i.i.d. assumptions of classical asset returns theory are not satisfied in reality, more attention should be paid to the measurement of the degree of dependence to determine the true risks to which any investment portfolio is exposed: the return distributions are time-varying and skewness and kurtosis occur and change over time. Conventional mean-variance diversification does not apply when the tails of the return distributions ate too fat, i.e., when many more than normal extreme events occur. Regrettably, also, Extreme Value Theory is empirically not valid, because it is based on the uncorroborated i.i.d. assumption.Long memory, Value at Risk, Extreme Value Theory, Portfolio Management, Degrees of Persistence

Research Papers in Economics

Low Power Dynamic Scheduling for Computing Systems

Author: Neely Michael J.
Publication venue
Publication date: 13/12/2011
Field of study

This paper considers energy-aware control for a computing system with two states: "active" and "idle." In the active state, the controller chooses to perform a single task using one of multiple task processing modes. The controller then saves energy by choosing an amount of time for the system to be idle. These decisions affect processing time, energy expenditure, and an abstract attribute vector that can be used to model other criteria of interest (such as processing quality or distortion). The goal is to optimize time average system performance. Applications of this model include a smart phone that makes energy-efficient computation and transmission decisions, a computer that processes tasks subject to rate, quality, and power constraints, and a smart grid energy manager that allocates resources in reaction to a time varying energy price. The solution methodology of this paper uses the theory of optimization for renewal systems developed in our previous work. This paper is written in tutorial form and develops the main concepts of the theory using several detailed examples. It also highlights the relationship between online dynamic optimization and linear fractional programming. Finally, it provides exercises to help the reader learn the main concepts and apply them to their own optimizations. This paper is an arxiv technical report, and is a preliminary version of material that will appear as a book chapter in an upcoming book on green communications and networking.Comment: 26 pages, 10 figures, single spac

arXiv.org e-Print Archive

CiteSeerX