Transformer encoder-decoder models have shown impressive performance in
dialogue modeling. However, as Transformers are inefficient in processing long
sequences, dialogue history length often needs to be truncated. To address this
problem, we propose a new memory-augmented Transformer that is compatible with
existing pre-trained encoder-decoder models and enables efficient preservation
of history information. It incorporates a separate memory module alongside the
pre-trained Transformer to effectively interchange information between the
memory states and the current input context. We evaluate our model on three
dialogue datasets and two language modeling datasets. Experimental results show
that our method has achieved superior efficiency and performance compared to
other pre-trained Transformer baselines