Applying reinforcement learning (RL) to combinatorial optimization problems
is attractive as it removes the need for expert knowledge or pre-solved
instances. However, it is unrealistic to expect an agent to solve these (often
NP-)hard problems in a single shot at inference due to their inherent
complexity. Thus, leading approaches often implement additional search
strategies, from stochastic sampling and beam-search to explicit fine-tuning.
In this paper, we argue for the benefits of learning a population of
complementary policies, which can be simultaneously rolled out at inference. To
this end, we introduce Poppy, a simple theoretically grounded training
procedure for populations. Instead of relying on a predefined or hand-crafted
notion of diversity, Poppy induces an unsupervised specialization targeted
solely at maximizing the performance of the population. We show that Poppy
produces a set of complementary policies, and obtains state-of-the-art RL
results on three popular NP-hard problems: the traveling salesman (TSP), the
capacitated vehicle routing (CVRP), and 0-1 knapsack (KP) problems. On TSP
specifically, Poppy outperforms the previous state-of-the-art, dividing the
optimality gap by 5 while reducing the inference time by more than an order of
magnitude