349 research outputs found
An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits
In this paper, we propose an information-theoretic exploration strategy for
stochastic, discrete multi-armed bandits that achieves optimal regret. Our
strategy is based on the value of information criterion. This criterion
measures the trade-off between policy information and obtainable rewards. High
amounts of policy information are associated with exploration-dominant searches
of the space and yield high rewards. Low amounts of policy information favor
the exploitation of existing knowledge. Information, in this criterion, is
quantified by a parameter that can be varied during search. We demonstrate that
a simulated-annealing-like update of this parameter, with a sufficiently fast
cooling schedule, leads to an optimal regret that is logarithmic with respect
to the number of episodes.Comment: Entrop
Control of Sample Complexity and Regret in Bandits using Fractional Moments
Abstract One key facet of learning through reinforcements is the dilemma between exploration to find profitable actions and exploitation to act optimal according to the observations already made. We analyze this explore/exploit situation on Bandit problems in stateless environments. We propose a family of learning algorithms for bandit problems based on fractional expectation of rewards acquired. The algorithms can be controlled to behave optimal with respect to sample complexity or regret, through a single parameter. The family is theoretically shown to contain algorithms that converge on an -optimal arm and achieve O(n) sample complexity, a theoretical minimum. The family is also shown to include algorithms that achieve the optimal logarithmic regret proved b
Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters
The multi-armed bandit problem is a classical optimization problem where an agent sequentially pulls one of multiple arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Dynamically changing (non-stationary) bandit problems are particularly challenging because each change of the reward distributions may progressively degrade the performance of any fixed strategy. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel solution scheme for bandit problems with non-stationary normally distributed rewards. The scheme is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling Kalman Filters, and on random sampling from these posteriors. Furthermore, it is able to track the better actions, thus supporting non-stationary bandit problems. Extensive experiments demonstrate that our scheme outperforms recently proposed bandit playing algorithms, not only in non-stationary environments, but in stationary environments also. Furthermore, our scheme is robust to inexact parameter settings. We thus believe that our methodology opens avenues for obtaining improved novel solutions
Why VAR Fails: Long Memory and Extreme Events in Financial Markets
The Value-at-Risk (VAR) measure is based on only the second moment of a rates of return distribution. It is an insufficient risk performance measure, since it ignores both the higher moments of the pricing distributions, like skewness and kurtosis, and all the fractional moments resulting from the long - term dependencies (long memory) of dynamic market pricing. Not coincidentally, the VaR methodology also devotes insufficient attention to the truly extreme financial events, i.e., those events that are catastrophic and that are clustering because of this long memory. Since the usual stationarity and i.i.d. assumptions of classical asset returns theory are not satisfied in reality, more attention should be paid to the measurement of the degree of dependence to determine the true risks to which any investment portfolio is exposed: the return distributions are time-varying and skewness and kurtosis occur and change over time. Conventional mean-variance diversification does not apply when the tails of the return distributions ate too fat, i.e., when many more than normal extreme events occur. Regrettably, also, Extreme Value Theory is empirically not valid, because it is based on the uncorroborated i.i.d. assumption.Long memory, Value at Risk, Extreme Value Theory, Portfolio Management, Degrees of Persistence
Low Power Dynamic Scheduling for Computing Systems
This paper considers energy-aware control for a computing system with two
states: "active" and "idle." In the active state, the controller chooses to
perform a single task using one of multiple task processing modes. The
controller then saves energy by choosing an amount of time for the system to be
idle. These decisions affect processing time, energy expenditure, and an
abstract attribute vector that can be used to model other criteria of interest
(such as processing quality or distortion). The goal is to optimize time
average system performance. Applications of this model include a smart phone
that makes energy-efficient computation and transmission decisions, a computer
that processes tasks subject to rate, quality, and power constraints, and a
smart grid energy manager that allocates resources in reaction to a time
varying energy price. The solution methodology of this paper uses the theory of
optimization for renewal systems developed in our previous work. This paper is
written in tutorial form and develops the main concepts of the theory using
several detailed examples. It also highlights the relationship between online
dynamic optimization and linear fractional programming. Finally, it provides
exercises to help the reader learn the main concepts and apply them to their
own optimizations. This paper is an arxiv technical report, and is a
preliminary version of material that will appear as a book chapter in an
upcoming book on green communications and networking.Comment: 26 pages, 10 figures, single spac
- …