17 research outputs found
Sequential Matrix Completion
We propose a novel algorithm for sequential matrix completion in a
recommender system setting, where the th entry of the matrix corresponds
to a user 's rating of product . The objective of the algorithm is to
provide a sequential policy for user-product pair recommendation which will
yield the highest possible ratings after a finite time horizon. The algorithm
uses a Gamma process factor model with two posterior-focused bandit policies,
Thompson Sampling and Information-Directed Sampling. While Thompson Sampling
shows competitive performance in simulations, state-of-the-art performance is
obtained from Information-Directed Sampling, which makes its recommendations
based off a ratio between the expected reward and a measure of information
gain. To our knowledge, this is the first implementation of Information
Directed Sampling on large real datasets.
This approach contributes to a recent line of research on bandit approaches
to collaborative filtering including Kawale et al. (2015), Li et al. (2010),
Bresler et al. (2014), Li et al. (2016), Deshpande & Montanari (2012), and Zhao
et al. (2013). The setting of this paper, as has been noted in Kawale et al.
(2015) and Zhao et al. (2013), presents significant challenges to bounding
regret after finite horizons. We discuss these challenges in relation to
simpler models for bandits with side information, such as linear or gaussian
process bandits, and hope the experiments presented here motivate further
research toward theoretical guarantees.Comment: 10 pages, 6 figure
Alternating Linear Bandits for Online Matrix-Factorization Recommendation
We consider the problem of online collaborative filtering in the online
setting, where items are recommended to the users over time. At each time step,
the user (selected by the environment) consumes an item (selected by the agent)
and provides a rating of the selected item. In this paper, we propose a novel
algorithm for online matrix factorization recommendation that combines linear
bandits and alternating least squares. In this formulation, the bandit feedback
is equal to the difference between the ratings of the best and selected items.
We evaluate the performance of the proposed algorithm over time using both
cumulative regret and average cumulative NDCG. Simulation results over three
synthetic datasets as well as three real-world datasets for online
collaborative filtering indicate the superior performance of the proposed
algorithm over two state-of-the-art online algorithms
Unreliable Multi-Armed Bandits: A Novel Approach to Recommendation Systems
We use a novel modification of Multi-Armed Bandits to create a new model for
recommendation systems. We model the recommendation system as a bandit seeking
to maximize reward by pulling on arms with unknown rewards. The catch however
is that this bandit can only access these arms through an unreliable
intermediate that has some level of autonomy while choosing its arms. For
example, in a streaming website the user has a lot of autonomy while choosing
content they want to watch. The streaming sites can use targeted advertising as
a means to bias opinions of these users. Here the streaming site is the bandit
aiming to maximize reward and the user is the unreliable intermediate. We model
the intermediate as accessing states via a Markov chain. The bandit is allowed
to perturb this Markov chain. We prove fundamental theorems for this setting
after which we show a close-to-optimal Explore-Commit algorithm.Comment: 4 pages, 4 figures, Aditya Narayan Ravi and Pranav Poduval have equal
contributio
Time-Sensitive Bandit Learning and Satisficing Thompson Sampling
The literature on bandit learning and regret analysis has focused on contexts
where the goal is to converge on an optimal action in a manner that limits
exploration costs. One shortcoming imposed by this orientation is that it does
not treat time preference in a coherent manner. Time preference plays an
important role when the optimal action is costly to learn relative to
near-optimal actions. This limitation has not only restricted the relevance of
theoretical results but has also influenced the design of algorithms. Indeed,
popular approaches such as Thompson sampling and UCB can fare poorly in such
situations. In this paper, we consider discounted rather than cumulative
regret, where a discount factor encodes time preference. We propose satisficing
Thompson sampling -- a variation of Thompson sampling -- and establish a strong
discounted regret bound for this new algorithm
Accurate Inference for Adaptive Linear Models
Estimators computed from adaptively collected data do not behave like their
non-adaptive brethren. Rather, the sequential dependence of the collection
policy can lead to severe distributional biases that persist even in the
infinite data limit. We develop a general method -- -decorrelation
-- for transforming the bias of adaptive linear regression estimators into
variance. The method uses only coarse-grained information about the data
collection policy and does not need access to propensity scores or exact
knowledge of the policy. We bound the finite-sample bias and variance of the
-estimator and develop asymptotically correct confidence intervals
based on a novel martingale central limit theorem. We then demonstrate the
empirical benefits of the generic -decorrelation procedure in two
different adaptive data settings: the multi-armed bandit and the autoregressive
time series.Comment: Typos fixed for clarificatio
The Sample Complexity of Online One-Class Collaborative Filtering
We consider the online one-class collaborative filtering (CF) problem that
consists of recommending items to users over time in an online fashion based on
positive ratings only. This problem arises when users respond only occasionally
to a recommendation with a positive rating, and never with a negative one. We
study the impact of the probability of a user responding to a recommendation,
p_f, on the sample complexity, i.e., the number of ratings required to make
`good' recommendations, and ask whether receiving positive and negative
ratings, instead of positive ratings only, improves the sample complexity. Both
questions arise in the design of recommender systems. We introduce a simple
probabilistic user model, and analyze the performance of an online user-based
CF algorithm. We prove that after an initial cold start phase, where
recommendations are invested in exploring the user's preferences, this
algorithm makes---up to a fraction of the recommendations required for updating
the user's preferences---perfect recommendations. The number of ratings
required for the cold start phase is nearly proportional to 1/p_f, and that for
updating the user's preferences is essentially independent of p_f. As a
consequence we find that, receiving positive and negative ratings instead of
only positive ones improves the number of ratings required for initial
exploration by a factor of 1/p_f, which can be significant.Comment: ICML 201
Risk-Averse Multi-Armed Bandit Problems under Mean-Variance Measure
The multi-armed bandit problems have been studied mainly under the measure of
expected total reward accrued over a horizon of length . In this paper, we
address the issue of risk in multi-armed bandit problems and develop parallel
results under the measure of mean-variance, a commonly adopted risk measure in
economics and mathematical finance. We show that the model-specific regret and
the model-independent regret in terms of the mean-variance of the reward
process are lower bounded by and ,
respectively. We then show that variations of the UCB policy and the DSEE
policy developed for the classic risk-neutral MAB achieve these lower bounds
Implementable confidence sets in high dimensional regression
We consider the setting of linear regression in high dimension. We focus on
the problem of constructing adaptive and honest confidence sets for the sparse
parameter \theta, i.e. we want to construct a confidence set for theta that
contains theta with high probability, and that is as small as possible. The l_2
diameter of a such confidence set should depend on the sparsity S of \theta -
the larger S, the wider the confidence set. However, in practice, S is unknown.
This paper focuses on constructing a confidence set for \theta which contains
\theta with high probability, whose diameter is adaptive to the unknown
sparsity S, and which is implementable in practice
Satisficing in Time-Sensitive Bandit Learning
Much of the recent literature on bandit learning focuses on algorithms that
aim to converge on an optimal action. One shortcoming is that this orientation
does not account for time sensitivity, which can play a crucial role when
learning an optimal action requires much more information than near-optimal
ones. Indeed, popular approaches such as upper-confidence-bound methods and
Thompson sampling can fare poorly in such situations. We consider instead
learning a satisficing action, which is near-optimal while requiring less
information, and propose satisficing Thompson sampling, an algorithm that
serves this purpose. We establish a general bound on expected discounted regret
and study the application of satisficing Thompson sampling to linear and
infinite-armed bandits, demonstrating arbitrarily large benefits over Thompson
sampling. We also discuss the relation between the notion of satisficing and
the theory of rate distortion, which offers guidance on the selection of
satisficing actions.Comment: This submission largely supersedes earlier work in arXiv:1704.0902
Distributed Online Learning in Social Recommender Systems
In this paper, we consider decentralized sequential decision making in
distributed online recommender systems, where items are recommended to users
based on their search query as well as their specific background including
history of bought items, gender and age, all of which comprise the context
information of the user. In contrast to centralized recommender systems, in
which there is a single centralized seller who has access to the complete
inventory of items as well as the complete record of sales and user
information, in decentralized recommender systems each seller/learner only has
access to the inventory of items and user information for its own products and
not the products and user information of other sellers, but can get commission
if it sells an item of another seller. Therefore the sellers must distributedly
find out for an incoming user which items to recommend (from the set of own
items or items of another seller), in order to maximize the revenue from own
sales and commissions. We formulate this problem as a cooperative contextual
bandit problem, analytically bound the performance of the sellers compared to
the best recommendation strategy given the complete realization of user
arrivals and the inventory of items, as well as the context-dependent purchase
probabilities of each item, and verify our results via numerical examples on a
distributed data set adapted based on Amazon data. We evaluate the dependence
of the performance of a seller on the inventory of items the seller has, the
number of connections it has with the other sellers, and the commissions which
the seller gets by selling items of other sellers to its users