Achieving robust language technologies that can perform well across the
world's many languages is a central goal of multilingual NLP. In this work, we
take stock of and empirically analyse task performance disparities that exist
between multilingual task-oriented dialogue (ToD) systems. We first define new
quantitative measures of absolute and relative equivalence in system
performance, capturing disparities across languages and within individual
languages. Through a series of controlled experiments, we demonstrate that
performance disparities depend on a number of factors: the nature of the ToD
task at hand, the underlying pretrained language model, the target language,
and the amount of ToD annotated data. We empirically prove the existence of the
adaptation and intrinsic biases in current ToD systems: e.g., ToD systems
trained for Arabic or Turkish using annotated ToD data fully parallel to
English ToD data still exhibit diminished ToD task performance. Beyond
providing a series of insights into the performance disparities of ToD systems
in different languages, our analyses offer practical tips on how to approach
ToD data collection and system development for new languages.Comment: Accepted to EMNLP 202