In this work, we study how the performance of a given direction changes with
its sampling ratio in Multilingual Neural Machine Translation (MNMT). By
training over 200 multilingual models with various model sizes, data sizes, and
language directions, we find it interesting that the performance of certain
translation direction does not always improve with the increase of its weight
in the multi-task optimization objective. Accordingly, scalarization method
leads to a multitask trade-off front that deviates from the traditional Pareto
front when there exists data imbalance in the training corpus, which poses a
great challenge to improve the overall performance of all directions. Based on
our observations, we propose the Double Power Law to predict the unique
performance trade-off front in MNMT, which is robust across various languages,
data adequacy, and the number of tasks. Finally, we formulate the sample ratio
selection problem in MNMT as an optimization problem based on the Double Power
Law. In our experiments, it achieves better performance than temperature
searching and gradient manipulation methods with only 1/5 to 1/2 of the total
training budget. We release the code at
https://github.com/pkunlp-icler/ParetoMNMT for reproduction.Comment: NeurIPS 202