22 research outputs found
Mask-Predict: Parallel Decoding of Conditional Masked Language Models
Most machine translation systems generate text autoregressively from left to
right. We, instead, use a masked language modeling objective to train a model
to predict any subset of the target words, conditioned on both the input text
and a partially masked target translation. This approach allows for efficient
iterative decoding, where we first predict all of the target words
non-autoregressively, and then repeatedly mask out and regenerate the subset of
words that the model is least confident about. By applying this strategy for a
constant number of iterations, our model improves state-of-the-art performance
levels for non-autoregressive and parallel decoding translation models by over
4 BLEU on average. It is also able to reach within about 1 BLEU point of a
typical left-to-right transformer model, while decoding significantly faster.Comment: EMNLP 201
Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation
Large language models (LLMs) demonstrate remarkable machine translation (MT)
abilities via prompting, even though they were not explicitly trained for this
task. However, even given the incredible quantities of data they are trained
on, LLMs can struggle to translate inputs with rare words, which are common in
low resource or domain transfer scenarios. We show that LLM prompting can
provide an effective solution for rare words as well, by using prior knowledge
from bilingual dictionaries to provide control hints in the prompts. We propose
a novel method, DiPMT, that provides a set of possible translations for a
subset of the input words, thereby enabling fine-grained phrase-level prompted
control of the LLM. Extensive experiments show that DiPMT outperforms the
baseline both in low-resource MT, as well as for out-of-domain MT. We further
provide a qualitative analysis of the benefits and limitations of this
approach, including the overall level of controllability that is achieved
Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation
There has been recent success in pre-training on monolingual data and
fine-tuning on Machine Translation (MT), but it remains unclear how to best
leverage a pre-trained model for a given MT task. This paper investigates the
benefits and drawbacks of freezing parameters, and adding new ones, when
fine-tuning a pre-trained model on MT. We focus on 1) Fine-tuning a model
trained only on English monolingual data, BART. 2) Fine-tuning a model trained
on monolingual data from 25 languages, mBART. For BART we get the best
performance by freezing most of the model parameters, and adding extra
positional embeddings. For mBART we match the performance of naive fine-tuning
for most language pairs, and outperform it for Nepali to English (0.5 BLEU) and
Czech to English (0.6 BLEU), all with a lower memory cost at training time.
When constraining ourselves to an out-of-domain training set for Vietnamese to
English we outperform the fine-tuning baseline by 0.9 BLEU
Discourse-Aware Soft Prompting for Text Generation
Current efficient fine-tuning methods (e.g., adapters, prefix-tuning, etc.)
have optimized conditional text generation via training a small set of extra
parameters of the neural language model, while freezing the rest for
efficiency. While showing strong performance on some generation tasks, they
don't generalize across all generation tasks. We show that soft-prompt based
conditional text generation can be improved with simple and efficient methods
that simulate modeling the discourse structure of human written text. We
investigate two design choices: First, we apply \textit{hierarchical blocking}
on the prefix parameters to simulate a higher-level discourse structure of
human written text. Second, we apply \textit{attention sparsity} on the prefix
parameters at different layers of the network and learn sparse transformations
on the softmax-function. We show that structured design of prefix parameters
yields more coherent, faithful and relevant generations than the baseline
prefix-tuning on all generation tasks