3 research outputs found
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation
Existing multilingual machine translation approaches mainly focus on
English-centric directions, while the non-English directions still lag behind.
In this work, we aim to build a many-to-many translation system with an
emphasis on the quality of non-English language directions. Our intuition is
based on the hypothesis that a universal cross-language representation leads to
better multilingual translation performance. To this end, we propose mRASP2, a
training method to obtain a single unified multilingual translation model.
mRASP2 is empowered by two techniques: a) a contrastive learning scheme to
close the gap among representations of different languages, and b) data
augmentation on both multiple parallel and monolingual data to further align
token representations. For English-centric directions, mRASP2 outperforms
existing best unified model and achieves competitive or even better performance
than the pre-trained and fine-tuned model mBART on tens of WMT's translation
directions. For non-English directions, mRASP2 achieves an improvement of
average 10+ BLEU compared with the multilingual Transformer baseline. Code,
data and trained models are available at https://github.com/PANXiao1994/mRASP2.Comment: accepted as long paper in ACL202
Weight Distillation: Transferring the Knowledge in Neural Network Parameters
Knowledge distillation has been proven to be effective in model acceleration
and compression. It allows a small network to learn to generalize in the same
way as a large network. Recent successes in pre-training suggest the
effectiveness of transferring model parameters. Inspired by this, we
investigate methods of model acceleration and compression in another line of
research. We propose Weight Distillation to transfer the knowledge in the large
network parameters through a parameter generator. Our experiments on WMT16
En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks show that weight
distillation can train a small network that is 1.88~2.94x faster than the large
network but with competitive performance. With the same sized small network,
weight distillation can outperform knowledge distillation by 0.51~1.82 BLEU
points.Comment: accepted by ACL202
Data-Efficient Methods for Dialogue Systems
Conversational User Interface (CUI) has become ubiquitous in everyday life,
in consumer-focused products like Siri and Alexa or business-oriented
solutions. Deep learning underlies many recent breakthroughs in dialogue
systems but requires very large amounts of training data, often annotated by
experts. Trained with smaller data, these methods end up severely lacking
robustness (e.g. to disfluencies and out-of-domain input), and often just have
too little generalisation power. In this thesis, we address the above issues by
introducing a series of methods for training robust dialogue systems from
minimal data. Firstly, we study two orthogonal approaches to dialogue:
linguistically informed and machine learning-based - from the data efficiency
perspective. We outline the steps to obtain data-efficient solutions with
either approach. We then introduce two data-efficient models for dialogue
response generation: the Dialogue Knowledge Transfer Network based on latent
variable dialogue representations, and the hybrid Generative-Retrieval
Transformer model (ranked first at the DSTC 8 Fast Domain Adaptation task).
Next, we address the problem of robustness given minimal data. As such, propose
a multitask LSTM-based model for domain-general disfluency detection. For the
problem of out-of-domain input, we present Turn Dropout, a data augmentation
technique for anomaly detection only using in-domain data, and introduce
autoencoder-augmented models for efficient training with Turn Dropout. Finally,
we focus on social dialogue and introduce a neural model for response ranking
in social conversation used in Alana, the 3rd place winner in the Amazon Alexa
Prize 2017 and 2018. We employ a novel technique of predicting the dialogue
length as the main ranking objective and show that this approach improves upon
the ratings-based counterpart in terms of data efficiency while matching it in
performance.Comment: PhD thesis submitted at Heriot-Watt University. Contains previously
published work (see the list in Section 1.4