436 research outputs found
Overview of the IWSLT 2017 Evaluation Campaign
The IWSLT 2017 evaluation campaign has organised three tasks. The Multilingual task, which is about training machine translation systems handling many-to-many language directions, including so-called zero-shot directions. The Dialogue task, which calls for the integration of context information in machine translation, in order to resolve anaphoric references that typically occur in human-human dialogue turns. And, finally, the Lecture task, which offers the challenge of automatically transcribing and translating real-life university lectures. Following the tradition of these reports, we will described all tasks in detail and present the results of all runs submitted by their participants
Neural Machine Translation with Dynamic Graph Convolutional Decoder
Existing wisdom demonstrates the significance of syntactic knowledge for the
improvement of neural machine translation models. However, most previous works
merely focus on leveraging the source syntax in the well-known encoder-decoder
framework. In sharp contrast, this paper proposes an end-to-end translation
architecture from the (graph \& sequence) structural inputs to the (graph \&
sequence) outputs, where the target translation and its corresponding syntactic
graph are jointly modeled and generated. We propose a customized Dynamic
Spatial-Temporal Graph Convolutional Decoder (Dyn-STGCD), which is designed for
consuming source feature representations and their syntactic graph, and
auto-regressively generating the target syntactic graph and tokens
simultaneously. We conduct extensive experiments on five widely acknowledged
translation benchmarks, verifying that our proposal achieves consistent
improvements over baselines and other syntax-aware variants
Reward Gaming in Conditional Text Generation
To align conditional text generation model outputs with desired behaviors,
there has been an increasing focus on training the model using reinforcement
learning (RL) with reward functions learned from human annotations. Under this
framework, we identify three common cases where high rewards are incorrectly
assigned to undesirable patterns: noise-induced spurious correlation, naturally
occurring spurious correlation, and covariate shift. We show that even though
learned metrics achieve high performance on the distribution of the data used
to train the reward function, the undesirable patterns may be amplified during
RL training of the text generation model. While there has been discussion about
reward gaming in the RL or safety community, in this discussion piece, we would
like to highlight reward gaming in the natural language generation (NLG)
community using concrete conditional text generation examples and discuss
potential fixes and areas for future work
MAP's not dead yet: Uncovering true language model modes by conditioning away degeneracy
It has been widely observed that exact or approximate MAP (mode-seeking)
decoding from natural language generation (NLG) models consistently leads to
degenerate outputs (Stahlberg and Byrne, 2019, Holtzman et al., 2019). This has
generally been attributed to either a fundamental inadequacy of modes in models
or weaknesses in language modeling. Contrastingly in this work, we emphasize
that degenerate modes can even occur in the absence of any model error, due to
contamination of the training data. Specifically, we show that mixing even a
tiny amount of low-entropy noise with a population text distribution can cause
the data distribution's mode to become degenerate, implying that any models
trained on it will be as well. As the unconditional mode of NLG models will
often be degenerate, we therefore propose to apply MAP decoding to the model's
distribution conditional on avoiding specific degeneracies. Using exact-search,
we empirically verify that the length-conditional modes of machine translation
models and language models are indeed more fluent and topical than their
unconditional modes. For the first time, we also share many examples of exact
modal sequences from these models, and from several variants of the LLaMA-7B
model. Notably, the modes of the LLaMA models are still degenerate, showing
that improvements in modeling have not fixed this issue. Because of the cost of
exact mode finding algorithms, we develop an approximate mode finding approach,
ACBS, which finds sequences that are both high-likelihood and high-quality. We
apply this approach to LLaMA-7B, a model which was not trained for instruction
following, and find that we are able to elicit reasonable outputs without any
finetuning.Comment: 49 pages, 3 figure
Rethinking Word-Level Auto-Completion in Computer-Aided Translation
Word-Level Auto-Completion (WLAC) plays a crucial role in Computer-Assisted
Translation. It aims at providing word-level auto-completion suggestions for
human translators. While previous studies have primarily focused on designing
complex model architectures, this paper takes a different perspective by
rethinking the fundamental question: what kind of words are good
auto-completions? We introduce a measurable criterion to answer this question
and discover that existing WLAC models often fail to meet this criterion.
Building upon this observation, we propose an effective approach to enhance
WLAC performance by promoting adherence to the criterion. Notably, the proposed
approach is general and can be applied to various encoder-based architectures.
Through extensive experiments, we demonstrate that our approach outperforms the
top-performing system submitted to the WLAC shared tasks in WMT2022, while
utilizing significantly smaller model sizes.Comment: EMNLP202
実応用を志向した機械翻訳システムの設計と評価
Tohoku University博士(情報科学)thesi
An Open Dataset and Model for Language Identification
Language identification (LID) is a fundamental step in many natural language
processing pipelines. However, current LID systems are far from perfect,
particularly on lower-resource languages. We present a LID model which achieves
a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201
languages, outperforming previous work. We achieve this by training on a
curated dataset of monolingual data, the reliability of which we ensure by
auditing a sample from each source and each language manually. We make both the
model and the dataset available to the research community. Finally, we carry
out detailed analysis into our model's performance, both in comparison to
existing open models and by language class.Comment: To be published in ACL 202
Accelerating Transformer Inference for Translation via Parallel Decoding
Autoregressive decoding limits the efficiency of transformers for Machine
Translation (MT). The community proposed specific network architectures and
learning-based methods to solve this issue, which are expensive and require
changes to the MT model, trading inference speed at the cost of the translation
quality. In this paper, we propose to address the problem from the point of
view of decoding algorithms, as a less explored but rather compelling
direction. We propose to reframe the standard greedy autoregressive decoding of
MT with a parallel formulation leveraging Jacobi and Gauss-Seidel fixed-point
iteration methods for fast inference. This formulation allows to speed up
existing models without training or modifications while retaining translation
quality. We present three parallel decoding algorithms and test them on
different languages and models showing how the parallelization introduces a
speedup up to 38% w.r.t. the standard autoregressive decoding and nearly 2x
when scaling the method on parallel resources. Finally, we introduce a decoding
dependency graph visualizer (DDGviz) that let us see how the model has learned
the conditional dependence between tokens and inspect the decoding procedure.Comment: Accepted at ACL 2023 main conferenc
LENS: A Learnable Evaluation Metric for Text Simplification
Training learnable metrics using modern language models has recently emerged
as a promising method for the automatic evaluation of machine translation.
However, existing human evaluation datasets for text simplification have
limited annotations that are based on unitary or outdated models, making them
unsuitable for this approach. To address these issues, we introduce the
SimpEval corpus that contains: SimpEval_past, comprising 12K human ratings on
2.4K simplifications of 24 past systems, and SimpEval_2022, a challenging
simplification benchmark consisting of over 1K human ratings of 360
simplifications including GPT-3.5 generated text. Training on SimpEval, we
present LENS, a Learnable Evaluation Metric for Text Simplification. Extensive
empirical results show that LENS correlates much better with human judgment
than existing metrics, paving the way for future progress in the evaluation of
text simplification. We also introduce Rank and Rate, a human evaluation
framework that rates simplifications from several models in a list-wise manner
using an interactive interface, which ensures both consistency and accuracy in
the evaluation process and is used to create the SimpEval datasets.Comment: Accepted at ACL 202
Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot Translation
Zero-shot translation (ZST), which is generally based on a multilingual
neural machine translation model, aims to translate between unseen language
pairs in training data. The common practice to guide the zero-shot language
mapping during inference is to deliberately insert the source and target
language IDs, e.g., for English and for German. Recent studies have
shown that language IDs sometimes fail to navigate the ZST task, making them
suffer from the off-target problem (non-target language words exist in the
generated translation) and, therefore, difficult to apply the current
multilingual translation model to a broad range of zero-shot language
scenarios. To understand when and why the navigation capabilities of language
IDs are weakened, we compare two extreme decoder input cases in the ZST
directions: Off-Target (OFF) and On-Target (ON) cases. By contrastively
visualizing the contextual word representations (CWRs) of these cases with
teacher forcing, we show that 1) the CWRs of different languages are
effectively distributed in separate regions when the sentence and ID are
matched (ON setting), and 2) if the sentence and ID are unmatched (OFF
setting), the CWRs of different languages are chaotically distributed. Our
analyses suggest that although they work well in ideal ON settings, language
IDs become fragile and lose their navigation ability when faced with off-target
tokens, which commonly exist during inference but are rare in training
scenarios. In response, we employ unlikelihood tuning on the negative (OFF)
samples to minimize their probability such that the language IDs can
discriminate between the on- and off-target tokens during training. Experiments
spanning 40 ZST directions show that our method reduces the off-target ratio by
-48.0% on average, leading to a +9.1 BLEU improvement with only an extra +0.3%
tuning cost
- …