Ulsan National Institute of Science and Technology
Abstract
Graduate School of Artificial Intelligence ArtificiMulti-armed bandit is a well-formulated test bed for designing sequential decision-making algorithms that deal with the exploration-exploitation dilemma. Bandit algorithm balances exploration towards uncertain domain and exploitation of the observed history to accurately estimate the reward distribution of each arm. Contextual bandit incorporates a context that contains rich information about the structure of bandit environment and determines the reward function, making the algorithms devised in this setting be more applicable to real-world problems like news recommendation. Focus on the practicality of contextual bandit algorithm, we consider the diversity and non-stationarity of bandit environment. Also, we assume that there exists a number of accumulated dataset from previous evaluations. To this end, we propose offline training of a reward prediction model via meta-learning so that the model can adapt to the changing environment. We consider Neural Processes (NP), a probabilistic few-shot learner that can estimate the uncertainty with its prediction. Adopting the upper confidence bound (UCB) exploration strategy, we propose NP-UCB, the exploration strategy based on the uncertainty estimate of trained neural processes. We evaluate the proposed algorithm with various neural processes on wheel bandit and news recommendation system. The results show that our method works well with the latest neural process model called Neural Bootstrapping Attentive Neural Processes (NEUBANP), which can adapt to dynamically changing environments with the help of its reliable uncertainty estimates.ope