Search CORE

43 research outputs found

An Asymptotically Optimal Policy for Uniform Bandits of Unknown Support

Author: Cowan Wesley
Katehakis Michael N.
Publication venue
Publication date: 24/09/2015
Field of study

Consider the problem of a controller sampling sequentially from a finite number of

N \geq 2

populations, specified by random variables

X^i_k

i = 1,\ldots , N,

and

k = 1, 2, \ldots

; where

X^i_k

denotes the outcome from population

i

the

k^{th}

time it is sampled. It is assumed that for each fixed

i

\{ X^i_k \}_{k \geq 1}

is a sequence of i.i.d. uniform random variables over some interval

[a_i, b_i]

, with the support (i.e.,

a_i, b_i

) unknown to the controller. The objective is to have a policy

\pi

for deciding, based on available data, from which of the

N

populations to sample from at any time

n=1,2,\ldots

so as to maximize the expected sum of outcomes of

n

samples or equivalently to minimize the regret due to lack on information of the parameters

\{ a_i \}

and

\{ b_i \}

. In this paper, we present a simple inflated sample mean (ISM) type policy that is asymptotically optimal in the sense of its regret achieving the asymptotic lower bound of Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.Comment: arXiv admin note: text overlap with arXiv:1504.0582

arXiv.org e-Print Archive

Asymptotically Optimal Sequential Experimentation Under Generalized Ranking

Author: Cowan Wesley
Katehakis Michael N.
Publication venue
Publication date: 17/12/2015
Field of study

We consider the \mnk{classical} problem of a controller activating (or sampling) sequentially from a finite number of

N \geq 2

populations, specified by unknown distributions. Over some time horizon, at each time

n = 1, 2, \ldots

, the controller wishes to select a population to sample, with the goal of sampling from a population that optimizes some "score" function of its distribution, e.g., maximizing the expected sum of outcomes or minimizing variability. We define a class of \textit{Uniformly Fast (UF)} sampling policies and show, under mild regularity conditions, that there is an asymptotic lower bound for the expected total number of sub-optimal population activations. Then, we provide sufficient conditions under which a UCB policy is UF and asymptotically optimal, since it attains this lower bound. Explicit solutions are provided for a number of examples of interest, including general score functionals on unconstrained Pareto distributions (of potentially infinite mean), and uniform distributions of unknown support. Additional results on bandits of Normal distributions are also provided

arXiv.org e-Print Archive

Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret

Author: Cowan Wesley
Katehakis Michael N.
Publication venue
Publication date: 17/12/2015
Field of study

The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies

\pi

. For any slowly increasing function

g

, subject to mild regularity constraints, we construct two policies (the

g

-Forcing, and the

g

-Inflated Sample Mean) that achieve a measure of regret of order

O(g(n))

almost surely as

n \to \infty

, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function

g

effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff

arXiv.org e-Print Archive

Inventory Control Involving Unknown Demand of Discrete Nonperishable Items - Analysis of a Newsvendor-based Policy

Author: Katehakis Michael N.
Yang Jian
Zhou Tingting
Publication venue
Publication date: 21/10/2015
Field of study

Inventory control with unknown demand distribution is considered, with emphasis placed on the case involving discrete nonperishable items. We focus on an adaptive policy which in every period uses, as much as possible, the optimal newsvendor ordering quantity for the empirical distribution learned up to that period. The policy is assessed using the regret criterion, which measures the price paid for ambiguity on demand distribution over

T

periods. When there are guarantees on the latter's separation from the critical newsvendor parameter

\beta=b/(h+b)

, a constant upper bound on regret can be found. Without any prior information on the demand distribution, we show that the regret does not grow faster than the rate

T^{1/2+\epsilon}

for any

\epsilon>0

. In view of a known lower bound, this is almost the best one could hope for. Simulation studies involving this along with other policies are also conducted

arXiv.org e-Print Archive

Dynamic Pricing in a Dual Market Environment

Author: Chen
Fleischhacker Adam
Katehakis Michael N.
Wen
Publication venue
Publication date: 24/09/2015
Field of study

This paper is concerned with the determination of pricing strategies for a firm that in each period of a finite horizon receives replenishment quantities of a single product which it sells in two markets, e.g., a long-distance market and an on-site market. The key difference between the two markets is that the long-distance market provides for a one period delay in demand fulfillment. In contrast, on-site orders must be filled immediately as the customer is at the physical on-site location. We model the demands in consecutive periods as independent random variables and their distributions depend on the item's price in accordance with two general stochastic demand functions: additive or multiplicative. The firm uses a single pool of inventory to fulfill demands from both markets. We investigate properties of the structure of the dynamic pricing strategy that maximizes the total expected discounted profit over the finite time horizon, under fixed or controlled replenishment conditions. Further, we provide conditions under which one market may be the preferred outlet to sale over the other

arXiv.org e-Print Archive

Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Author: Cowan Wesley
Honda Junya
Katehakis Michael N.
Publication venue
Publication date: 02/06/2015
Field of study

Consider the problem of sampling sequentially from a finite number of

N \geq 2

populations, specified by random variables

X^i_k

i = 1,\ldots , N,

and

k = 1, 2, \ldots

; where

X^i_k

denotes the outcome from population

i

the

k^{th}

time it is sampled. It is assumed that for each fixed

i

\{ X^i_k \}_{k \geq 1}

is a sequence of i.i.d. normal random variables, with unknown mean

\mu_i

and unknown variance

\sigma_i^2

. The objective is to have a policy

\pi

for deciding from which of the

N

populations to sample form at any time

n=1,2,\ldots

so as to maximize the expected sum of outcomes of

n

samples or equivalently to minimize the regret due to lack on information of the parameters

\mu_i

and

\sigma_i^2

. In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.Comment: 15 pages 3 figure

arXiv.org e-Print Archive

Optimal Data Driven Resource Allocation under Multi-Armed Bandit Observations

Author: Burnetas Apostolos N.
Kanavetas Odysseas
Katehakis Michael N.
Publication venue
Publication date: 13/12/2018
Field of study

This paper introduces the first asymptotically optimal strategy for a multi armed bandit (MAB) model under side constraints. The side constraints model situations in which bandit activations are limited by the availability of certain resources that are replenished at a constant rate. The main result involves the derivation of an asymptotic lower bound for the regret of feasible uniformly fast policies and the construction of policies that achieve this lower bound, under pertinent conditions. Further, we provide the explicit form of such policies for the case in which the unknown distributions are Normal with unknown means and known variances, for the case of Normal distributions with unknown means and unknown variances and for the case of arbitrary discrete distributions with finite support.Comment: arXiv admin note: text overlap with arXiv:1509.0285

arXiv.org e-Print Archive

Cash-Flow Based Dynamic Inventory Management

Author: Katehakis Michael N.
Melamed Benjamin
Shi Jim
Publication venue
Publication date: 22/09/2015
Field of study

Small-to-medium size enterprises (SMEs), including many startup firms, need to manage interrelated flows of cash and inventories of goods. In this paper, we model a firm that can finance its inventory (ordered or manufactured) with loans in order to meet random demand which in general may not be time stationary. The firm earns interest on its cash on hand and pays interest on its debt. The objective is to maximize the expected value of the firm's %working capital at the end of a finite planning horizon. Our study shows that the optimal ordering policy is characterized by a pair of threshold variables for each period as function of the initial state of the period. Further, upper and lower bounds for the threshold values are developed using two simple-to-compute ordering policies. Based on these bounds, we provide an efficient algorithm to compute the two threshold values. Since the underlying state space is two-dimensional which leads to high computational complexity of the optimization algorithm, we also derive upper bounds for the optimal value function by reducing the optimization problem to one dimension. Subsequently, it is shown that policies of similar structure are optimal when the loan and deposit interest rates are piecewise linear functions, when there is a maximal loan limit and when unsatisfied demand is backordered. Finally, further managerial insights are provided with numerical studies

arXiv.org e-Print Archive

A Comparative Analysis of the Successive Lumping and the Lattice Path Counting Algorithms

Author: Katehakis Michael N.
Smit Laurens C.
Spieksma Floske M.
Publication venue
Publication date: 19/07/2015
Field of study

This article provides a comparison of the successive lumping (SL) methodology with the popular lattice path counting algorithm in obtaining rate matrices for queueing models, satisfying the quasi birth and death structure. The two methodologies are compared both in terms of applicability requirements and numerical complexity by analyzing their performance for the same classical queueing models. The main findings are: i) When both methods are applicable SL based algorithms outperform the lattice path counting algorithm (LPCA). ii) There are important classes of problems (e.g., models with (level) non-homogenous rates or with finite state spaces) for which the SL methodology is applicable and for which the LPCA cannot be used. iii) Another main advantage of successive lumping algorithms over LPCAs is that the former includes a method to compute the steady state distribution using this rate matrix

arXiv.org e-Print Archive

On the Solution to a Countable System of Equations Arising in Stochastic Processes

Author: Katehakis Michael N.
Smit Laurens C.
Spieksma Floske M.
Publication venue
Publication date: 20/10/2015
Field of study

In this paper we develop a method to compute the solution to a countable (finite or infinite) set of equations that occurs in many different fields including Markov processes that model queueing systems, birth-and-death processes and inventory systems. The method provides a fast and exact computation of the inverse of the matrix of the coefficients of the system. In contrast, alternative inverse techniques perform much slower and work only for finite size matrices. Furthermore, we provide a procedure to construct the eigenvalues of the matrix under consideration

arXiv.org e-Print Archive