43 research outputs found
An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support
Consider the problem of a controller sampling sequentially from a finite
number of populations, specified by random variables , and ; where denotes the outcome from
population the time it is sampled. It is assumed that for each
fixed , is a sequence of i.i.d. uniform random
variables over some interval , with the support (i.e., )
unknown to the controller. The objective is to have a policy for
deciding, based on available data, from which of the populations to sample
from at any time so as to maximize the expected sum of outcomes
of samples or equivalently to minimize the regret due to lack on
information of the parameters and . In this paper, we
present a simple inflated sample mean (ISM) type policy that is asymptotically
optimal in the sense of its regret achieving the asymptotic lower bound of
Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are
given.Comment: arXiv admin note: text overlap with arXiv:1504.0582
Asymptotically Optimal Sequential Experimentation Under Generalized Ranking
We consider the \mnk{classical} problem of a controller activating (or
sampling) sequentially from a finite number of populations,
specified by unknown distributions. Over some time horizon, at each time , the controller wishes to select a population to sample, with the
goal of sampling from a population that optimizes some "score" function of its
distribution, e.g., maximizing the expected sum of outcomes or minimizing
variability. We define a class of \textit{Uniformly Fast (UF)} sampling
policies and show, under mild regularity conditions, that there is an
asymptotic lower bound for the expected total number of sub-optimal population
activations. Then, we provide sufficient conditions under which a UCB policy is
UF and asymptotically optimal, since it attains this lower bound. Explicit
solutions are provided for a number of examples of interest, including general
score functionals on unconstrained Pareto distributions (of potentially
infinite mean), and uniform distributions of unknown support. Additional
results on bandits of Normal distributions are also provided
Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret
The purpose of this paper is to provide further understanding into the
structure of the sequential allocation ("stochastic multi-armed bandit", or
MAB) problem by establishing probability one finite horizon bounds and
convergence rates for the sample (or "pseudo") regret associated with two
simple classes of allocation policies .
For any slowly increasing function , subject to mild regularity
constraints, we construct two policies (the -Forcing, and the -Inflated
Sample Mean) that achieve a measure of regret of order almost surely
as , bound from above and below. Additionally, almost sure upper
and lower bounds on the remainder term are established. In the constructions
herein, the function effectively controls the "exploration" of the
classical "exploration/exploitation" tradeoff
Inventory Control Involving Unknown Demand of Discrete Nonperishable Items - Analysis of a Newsvendor-based Policy
Inventory control with unknown demand distribution is considered, with
emphasis placed on the case involving discrete nonperishable items. We focus on
an adaptive policy which in every period uses, as much as possible, the optimal
newsvendor ordering quantity for the empirical distribution learned up to that
period. The policy is assessed using the regret criterion, which measures the
price paid for ambiguity on demand distribution over periods. When there
are guarantees on the latter's separation from the critical newsvendor
parameter , a constant upper bound on regret can be found.
Without any prior information on the demand distribution, we show that the
regret does not grow faster than the rate for any
. In view of a known lower bound, this is almost the best one could
hope for. Simulation studies involving this along with other policies are also
conducted
Dynamic Pricing in a Dual Market Environment
This paper is concerned with the determination of pricing strategies for a
firm that in each period of a finite horizon receives replenishment quantities
of a single product which it sells in two markets, e.g., a long-distance market
and an on-site market. The key difference between the two markets is that the
long-distance market provides for a one period delay in demand fulfillment. In
contrast, on-site orders must be filled immediately as the customer is at the
physical on-site location. We model the demands in consecutive periods as
independent random variables and their distributions depend on the item's price
in accordance with two general stochastic demand functions: additive or
multiplicative.
The firm uses a single pool of inventory to fulfill demands from both
markets. We investigate properties of the structure of the dynamic pricing
strategy that maximizes the total expected discounted profit over the finite
time horizon, under fixed or controlled replenishment conditions. Further, we
provide conditions under which one market may be the preferred outlet to sale
over the other
Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem
Consider the problem of sampling sequentially from a finite number of populations, specified by random variables , and
; where denotes the outcome from population the
time it is sampled. It is assumed that for each fixed ,
is a sequence of i.i.d. normal random variables,
with unknown mean and unknown variance .
The objective is to have a policy for deciding from which of the
populations to sample form at any time so as to maximize the
expected sum of outcomes of samples or equivalently to minimize the regret
due to lack on information of the parameters and . In this
paper, we present a simple inflated sample mean (ISM) index policy that is
asymptotically optimal in the sense of Theorem 4 below. This resolves a
standing open problem from Burnetas and Katehakis (1996). Additionally, finite
horizon regret bounds are given.Comment: 15 pages 3 figure
Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations
This paper introduces the first asymptotically optimal strategy for a multi
armed bandit (MAB) model under side constraints. The side constraints model
situations in which bandit activations are limited by the availability of
certain resources that are replenished at a constant rate. The main result
involves the derivation of an asymptotic lower bound for the regret of feasible
uniformly fast policies and the construction of policies that achieve this
lower bound, under pertinent conditions. Further, we provide the explicit form
of such policies for the case in which the unknown distributions are Normal
with unknown means and known variances, for the case of Normal distributions
with unknown means and unknown variances and for the case of arbitrary discrete
distributions with finite support.Comment: arXiv admin note: text overlap with arXiv:1509.0285
Cash-Flow Based Dynamic Inventory Management
Small-to-medium size enterprises (SMEs), including many startup firms, need
to manage interrelated flows of cash and inventories of goods. In this paper,
we model a firm that can finance its inventory (ordered or manufactured) with
loans in order to meet random demand which in general may not be time
stationary. The firm earns interest on its cash on hand and pays interest on
its debt. The objective is to maximize the expected value of the firm's
%working capital at the end of a finite planning horizon. Our study shows that
the optimal ordering policy is characterized by a pair of threshold variables
for each period as function of the initial state of the period. Further, upper
and lower bounds for the threshold values are developed using two
simple-to-compute ordering policies. Based on these bounds, we provide an
efficient algorithm to compute the two threshold values. Since the underlying
state space is two-dimensional which leads to high computational complexity of
the optimization algorithm, we also derive upper bounds for the optimal value
function by reducing the optimization problem to one dimension. Subsequently,
it is shown that policies of similar structure are optimal when the loan and
deposit interest rates are piecewise linear functions, when there is a maximal
loan limit and when unsatisfied demand is backordered. Finally, further
managerial insights are provided with numerical studies
A Comparative Analysis of the Successive Lumping and the Lattice Path Counting Algorithms
This article provides a comparison of the successive lumping (SL) methodology
with the popular lattice path counting algorithm in obtaining rate matrices for
queueing models, satisfying the quasi birth and death structure. The two
methodologies are compared both in terms of applicability requirements and
numerical complexity by analyzing their performance for the same classical
queueing models.
The main findings are: i) When both methods are applicable SL based
algorithms outperform the lattice path counting algorithm (LPCA). ii) There are
important classes of problems (e.g., models with (level) non-homogenous rates
or with finite state spaces) for which the SL methodology is applicable and for
which the LPCA cannot be used. iii) Another main advantage of successive
lumping algorithms over LPCAs is that the former includes a method to compute
the steady state distribution using this rate matrix
On the Solution to a Countable System of Equations Arising in Stochastic Processes
In this paper we develop a method to compute the solution to a countable
(finite or infinite) set of equations that occurs in many different fields
including Markov processes that model queueing systems, birth-and-death
processes and inventory systems. The method provides a fast and exact
computation of the inverse of the matrix of the coefficients of the system. In
contrast, alternative inverse techniques perform much slower and work only for
finite size matrices. Furthermore, we provide a procedure to construct the
eigenvalues of the matrix under consideration