We study the application of the Thompson sampling (TS) methodology to the
stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the
standard TS algorithm for the general CMAB, and obtain the first
distribution-dependent regret bound of O(mKmaxβlogT/Ξminβ),
where m is the number of arms, Kmaxβ is the size of the largest super
arm, T is the time horizon, and Ξminβ is the minimum gap between
the expected reward of the optimal solution and any non-optimal solution. We
also show that one cannot directly replace the exact offline oracle with an
approximation oracle in TS algorithm for even the classical MAB problem. Then
we expand the analysis to two special cases: the linear reward case and the
matroid bandit case. When the reward function is linear, the regret of the TS
algorithm achieves a better bound O(mKmaxββlogT/Ξminβ).
For matroid bandit, we could remove the independence assumption across arms and
achieve a regret upper bound that matches the lower bound for the matroid case.
Finally, we use some experiments to show the comparison between regrets of TS
and other existing algorithms like CUCB and ESCB