9 research outputs found
Extending Word-Level Quality Estimation for Post-Editing Assistance
We define a novel concept called extended word alignment in order to improve
post-editing assistance efficiency. Based on extended word alignment, we
further propose a novel task called refined word-level QE that outputs refined
tags and word-level correspondences. Compared to original word-level QE, the
new task is able to directly point out editing operations, thus improves
efficiency. To extract extended word alignment, we adopt a supervised method
based on mBERT. To solve refined word-level QE, we firstly predict original QE
tags by training a regression model for sequence tagging based on mBERT and
XLM-R. Then, we refine original word tags with extended word alignment. In
addition, we extract source-gap correspondences, meanwhile, obtaining gap tags.
Experiments on two language pairs show the feasibility of our method and give
us inspirations for further improvement
Survey of Low-Resource Machine Translation
International audienceWe present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT
PaLM: Scaling Language Modeling with Pathways
Large language models have been shown to achieve remarkable performance
across a variety of natural language tasks using few-shot learning, which
drastically reduces the number of task-specific training examples needed to
adapt the model to a particular application. To further our understanding of
the impact of scale on few-shot learning, we trained a 540-billion parameter,
densely activated, Transformer language model, which we call Pathways Language
Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML
system which enables highly efficient training across multiple TPU Pods. We
demonstrate continued benefits of scaling by achieving state-of-the-art
few-shot learning results on hundreds of language understanding and generation
benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough
performance, outperforming the finetuned state-of-the-art on a suite of
multi-step reasoning tasks, and outperforming average human performance on the
recently released BIG-bench benchmark. A significant number of BIG-bench tasks
showed discontinuous improvements from model scale, meaning that performance
steeply increased as we scaled to our largest model. PaLM also has strong
capabilities in multilingual tasks and source code generation, which we
demonstrate on a wide array of benchmarks. We additionally provide a
comprehensive analysis on bias and toxicity, and study the extent of training
data memorization with respect to model scale. Finally, we discuss the ethical
considerations related to large language models and discuss potential
mitigation strategies
Structural pruning for speed in neural machine translation
Neural machine translation (NMT) strongly outperforms previous statistical techniques. With
the emergence of a transformer architecture, we consistently train and deploy deeper and
larger models, often with billions of parameters, as an ongoing effort to achieve even better
quality. On the other hand, there is also a constant pursuit for optimisation opportunities to
reduce inference runtime.
Parameter pruning is one of the staple optimisation techniques. Even though coefficient-wise
sparsity is the most popular for compression purposes, it is not easy to make a model run
faster. Sparse matrix multiplication routines require custom approaches, usually depending on
low-level hardware implementations for the most efficiency. In my thesis, I focus on structural
pruning in the field of NMT, which results in smaller but still dense architectures that do not
need any further modifications to work efficiently.
My research focuses on two main directions. The first one explores Lottery Ticket Hypothesis
(LTH), a well-known pruning algorithm, but this time in a structural setup with a custom pruning
criterion. It involves partial training and pruning steps performed in a loop. Experiments with
LTH produced substantial speed-up when applied to prune heads in the attention mechanism
of a transformer. While this method has proven successful, it carries the burden of prolonged
training cost that makes an already expensive training routine even more so.
From that point, I exclusively concentrate on research incorporating pruning into training via
regularisation. I experiment with a standard group lasso, which zeroes-out parameters together
in a structural pre-defined way. By targeting feedforward and attention layers in a transformer,
group lasso significantly improves inference speed with already optimised state-of-the-art fast
models. Improving upon that work, I designed a novel approach called aided regularisation,
where every layer penalty is scaled based on statistics gathered as training progresses. Both
gradient- and parameter-based approaches aim to decrease the depth of a model, further
optimising speed while maintaining the translation quality of an unpruned baseline.
The goal of this dissertation is to advance the state-of-the-art efficient NMT with simple but
tangible structural sparsity methods. The majority of all experiments in the thesis involve
highly-optimised models as baselines to show that this work pushes the Pareto frontier of
quality vs speed trade-off forward. For example, it is possible to prune a model to be 50% faster
with no change in translation quality
Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution
Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution
Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding