A major bottleneck in training end-to-end task-oriented dialog system is the
lack of data. To utilize limited training data more efficiently, we propose
Modular Supervision Network (MOSS), an encoder-decoder training framework that
could incorporate supervision from various intermediate dialog system modules
including natural language understanding, dialog state tracking, dialog policy
learning, and natural language generation. With only 60% of the training data,
MOSS-all (i.e., MOSS with supervision from all four dialog modules) outperforms
state-of-the-art models on CamRest676. Moreover, introducing modular
supervision has even bigger benefits when the dialog task has a more complex
dialog state and action space. With only 40% of the training data, MOSS-all
outperforms the state-of-the-art model on a complex laptop network
troubleshooting dataset, LaptopNetwork, that we introduced. LaptopNetwork
consists of conversations between real customers and customer service agents in
Chinese. Moreover, MOSS framework can accommodate dialogs that have supervision
from different dialog modules at both the framework level and model level.
Therefore, MOSS is extremely flexible to update in a real-world deployment