Search CORE

102 research outputs found

A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

Author: Agrawal Priyank
Avadhanula Vashist
Tulabandhula Theja
Publication venue
Publication date: 18/06/2022
Field of study

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes their response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describes the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon

T

. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by

O(\sqrt{\kappa d T})

, where

\kappa

is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by

O(\sqrt{dT} + \kappa)

, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.Comment: updated version, under revie

arXiv.org e-Print Archive

PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits

Author: Dumitrascu Bianca
Engelhardt Barbara E
Feng Karen
Publication venue
Publication date: 01/01/2018
Field of study

We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits

arXiv.org e-Print Archive

Princeton University Open Access Repository

Dynamic Assortment Optimization with Changing Contextual Information

Author: Chen Xi
Wang Yining
Zhou Yuan
Publication venue
Publication date: 17/01/2019
Field of study

In this paper, we study the dynamic assortment optimization problem under a finite selling season of length

T

. At each time period, the seller offers an arriving customer an assortment of substitutable products under a cardinality constraint, and the customer makes the purchase among offered products according to a discrete choice model. Most existing work associates each product with a real-valued fixed mean utility and assumes a multinomial logit choice (MNL) model. In many practical applications, feature/contexutal information of products is readily available. In this paper, we incorporate the feature information by assuming a linear relationship between the mean utility and the feature. In addition, we allow the feature information of products to change over time so that the underlying choice model can also be non-stationary. To solve the dynamic assortment optimization under this changing contextual MNL model, we need to simultaneously learn the underlying unknown coefficient and makes the decision on the assortment. To this end, we develop an upper confidence bound (UCB) based policy and establish the regret bound on the order of

\widetilde O(d\sqrt{T})

, where

d

is the dimension of the feature and

\widetilde O

suppresses logarithmic dependence. We further established the lower bound

\Omega(d\sqrt{T}/K)

where

K

is the cardinality constraint of an offered assortment, which is usually small. When

K

is a constant, our policy is optimal up to logarithmic factors. In the exploitation phase of the UCB algorithm, we need to solve a combinatorial optimization for assortment optimization based on the learned information. We further develop an approximation algorithm and an efficient greedy heuristic. The effectiveness of the proposed policy is further demonstrated by our numerical studies.Comment: 4 pages, 4 figures. Minor revision and polishing of presentatio

arXiv.org e-Print Archive

IUScholarWorks Open

Improved Regret Bounds of (Multinomial) Logistic Bandits via Regret-to-Confidence-Set Conversion

Author: Jun Kwang-Sung
Lee Junghyun
Yun Se-Young
Publication venue
Publication date: 27/10/2023
Field of study

Logistic bandit is a ubiquitous framework of modeling users' choices, e.g., click vs. no click for advertisement recommender system. We observe that the prior works overlook or neglect dependencies in

S \geq \lVert \theta_\star \rVert_2

, where

\theta_\star \in \mathbb{R}^d

is the unknown parameter vector, which is particularly problematic when

S

is large, e.g.,

S \geq d

. In this work, we improve the dependency on

S

via a novel approach called {\it regret-to-confidence set conversion (R2CS)}, which allows us to construct a convex confidence set based on only the \textit{existence} of an online learning algorithm with a regret guarantee. Using R2CS, we obtain a strict improvement in the regret bound w.r.t.

S

in logistic bandits while retaining computational feasibility and the dependence on other factors such as

d

and

T

. We apply our new confidence set to the regret analyses of logistic bandits with a new martingale concentration step that circumvents an additional factor of

S

. We then extend this analysis to multinomial logistic bandits and obtain similar improvements in the regret, showing the efficacy of R2CS. While we applied R2CS to the (multinomial) logistic model, R2CS is a generic approach for developing confidence sets that can be used for various models, which can be of independent interest.Comment: 32 pages, 2 figures, 1 tabl

arXiv.org e-Print Archive

Revenue Maximization and Learning in Products Ranking

Author: Chen Ningyuan
Li Anran
Yang Shuoguang
Publication venue
Publication date: 07/12/2020
Field of study

We consider the revenue maximization problem for an online retailer who plans to display a set of products differing in their prices and qualities and rank them in order. The consumers have random attention spans and view the products sequentially before purchasing a ``satisficing'' product or leaving the platform empty-handed when the attention span gets exhausted. Our framework extends the cascade model in two directions: the consumers have random attention spans instead of fixed ones and the firm maximizes revenues instead of clicking probabilities. We show a nested structure of the optimal product ranking as a function of the attention span when the attention span is fixed and design a

1/e

-approximation algorithm accordingly for the random attention spans. When the conditional purchase probabilities are not known and may depend on consumer and product features, we devise an online learning algorithm that achieves

\tilde{\mathcal{O}}(\sqrt{T})

regret relative to the approximation algorithm, despite of the censoring of information: the attention span of a customer who purchases an item is not observable. Numerical experiments demonstrate the outstanding performance of the approximation and online learning algorithms

arXiv.org e-Print Archive