In this work, we investigate the task of textual response generation in a
multimodal task-oriented dialogue system. Our work is based on the recently
released Multimodal Dialogue (MMD) dataset (Saha et al., 2017) in the fashion
domain. We introduce a multimodal extension to the Hierarchical Recurrent
Encoder-Decoder (HRED) model and show that this extension outperforms strong
baselines in terms of text-based similarity metrics. We also showcase the
shortcomings of current vision and language models by performing an error
analysis on our system's output