349 research outputs found

    An Analysis of the Value of Information when Exploring Stochastic, Discrete Multi-Armed Bandits

    Full text link
    In this paper, we propose an information-theoretic exploration strategy for stochastic, discrete multi-armed bandits that achieves optimal regret. Our strategy is based on the value of information criterion. This criterion measures the trade-off between policy information and obtainable rewards. High amounts of policy information are associated with exploration-dominant searches of the space and yield high rewards. Low amounts of policy information favor the exploitation of existing knowledge. Information, in this criterion, is quantified by a parameter that can be varied during search. We demonstrate that a simulated-annealing-like update of this parameter, with a sufficiently fast cooling schedule, leads to an optimal regret that is logarithmic with respect to the number of episodes.Comment: Entrop

    Control of Sample Complexity and Regret in Bandits using Fractional Moments

    Get PDF
    Abstract One key facet of learning through reinforcements is the dilemma between exploration to find profitable actions and exploitation to act optimal according to the observations already made. We analyze this explore/exploit situation on Bandit problems in stateless environments. We propose a family of learning algorithms for bandit problems based on fractional expectation of rewards acquired. The algorithms can be controlled to behave optimal with respect to sample complexity or regret, through a single parameter. The family is theoretically shown to contain algorithms that converge on an -optimal arm and achieve O(n) sample complexity, a theoretical minimum. The family is also shown to include algorithms that achieve the optimal logarithmic regret proved b

    Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters

    Get PDF
    The multi-armed bandit problem is a classical optimization problem where an agent sequentially pulls one of multiple arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Dynamically changing (non-stationary) bandit problems are particularly challenging because each change of the reward distributions may progressively degrade the performance of any fixed strategy. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel solution scheme for bandit problems with non-stationary normally distributed rewards. The scheme is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling Kalman Filters, and on random sampling from these posteriors. Furthermore, it is able to track the better actions, thus supporting non-stationary bandit problems. Extensive experiments demonstrate that our scheme outperforms recently proposed bandit playing algorithms, not only in non-stationary environments, but in stationary environments also. Furthermore, our scheme is robust to inexact parameter settings. We thus believe that our methodology opens avenues for obtaining improved novel solutions

    Why VAR Fails: Long Memory and Extreme Events in Financial Markets

    Get PDF
    The Value-at-Risk (VAR) measure is based on only the second moment of a rates of return distribution. It is an insufficient risk performance measure, since it ignores both the higher moments of the pricing distributions, like skewness and kurtosis, and all the fractional moments resulting from the long - term dependencies (long memory) of dynamic market pricing. Not coincidentally, the VaR methodology also devotes insufficient attention to the truly extreme financial events, i.e., those events that are catastrophic and that are clustering because of this long memory. Since the usual stationarity and i.i.d. assumptions of classical asset returns theory are not satisfied in reality, more attention should be paid to the measurement of the degree of dependence to determine the true risks to which any investment portfolio is exposed: the return distributions are time-varying and skewness and kurtosis occur and change over time. Conventional mean-variance diversification does not apply when the tails of the return distributions ate too fat, i.e., when many more than normal extreme events occur. Regrettably, also, Extreme Value Theory is empirically not valid, because it is based on the uncorroborated i.i.d. assumption.Long memory, Value at Risk, Extreme Value Theory, Portfolio Management, Degrees of Persistence

    Low Power Dynamic Scheduling for Computing Systems

    Full text link
    This paper considers energy-aware control for a computing system with two states: "active" and "idle." In the active state, the controller chooses to perform a single task using one of multiple task processing modes. The controller then saves energy by choosing an amount of time for the system to be idle. These decisions affect processing time, energy expenditure, and an abstract attribute vector that can be used to model other criteria of interest (such as processing quality or distortion). The goal is to optimize time average system performance. Applications of this model include a smart phone that makes energy-efficient computation and transmission decisions, a computer that processes tasks subject to rate, quality, and power constraints, and a smart grid energy manager that allocates resources in reaction to a time varying energy price. The solution methodology of this paper uses the theory of optimization for renewal systems developed in our previous work. This paper is written in tutorial form and develops the main concepts of the theory using several detailed examples. It also highlights the relationship between online dynamic optimization and linear fractional programming. Finally, it provides exercises to help the reader learn the main concepts and apply them to their own optimizations. This paper is an arxiv technical report, and is a preliminary version of material that will appear as a book chapter in an upcoming book on green communications and networking.Comment: 26 pages, 10 figures, single spac
    • …
    corecore