121 research outputs found
Auto-Encoding Variational Neural Machine Translation
We present a deep generative model of bilingual sentence pairs for machine
translation. The model generates source and target sentences jointly from a
shared latent representation and is parameterised by neural networks. We
perform efficient training using amortised variational inference and
reparameterised gradients. Additionally, we discuss the statistical
implications of joint modelling and propose an efficient approximation to
maximum a posteriori decoding for fast test-time predictions. We demonstrate
the effectiveness of our model in three machine translation scenarios:
in-domain training, mixed-domain training, and learning from a mix of
gold-standard and synthetic data. Our experiments show consistently that our
joint formulation outperforms conditional modelling (i.e. standard neural
machine translation) in all such scenarios
How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation
Sentence encoders map sentences to real valued vectors for use in downstream
applications. To peek into these representations - e.g., to increase
interpretability of their results - probing tasks have been designed which
query them for linguistic knowledge. However, designing probing tasks for
lesser-resourced languages is tricky, because these often lack large-scale
annotated data or (high-quality) dependency parsers as a prerequisite of
probing task design in English. To investigate how to probe sentence embeddings
in such cases, we investigate sensitivity of probing task results to structural
design choices, conducting the first such large scale study. We show that
design choices like size of the annotated probing dataset and type of
classifier used for evaluation do (sometimes substantially) influence probing
outcomes. We then probe embeddings in a multilingual setup with design choices
that lie in a 'stable region', as we identify for English, and find that
results on English do not transfer to other languages. Fairer and more
comprehensive sentence-level probing evaluation should thus be carried out on
multiple languages in the future
Generating CCG Categories
Previous CCG supertaggers usually predict categories using multi-class
classification. Despite their simplicity, internal structures of categories are
usually ignored. The rich semantics inside these structures may help us to
better handle relations among categories and bring more robustness into
existing supertaggers. In this work, we propose to generate categories rather
than classify them: each category is decomposed into a sequence of smaller
atomic tags, and the tagger aims to generate the correct sequence. We show that
with this finer view on categories, annotations of different categories could
be shared and interactions with sentence contexts could be enhanced. The
proposed category generator is able to achieve state-of-the-art tagging (95.5%
accuracy) and parsing (89.8% labeled F1) performances on the standard CCGBank.
Furthermore, its performances on infrequent (even unseen) categories,
out-of-domain texts and low resource language give promising results on
introducing generation models to the general CCG analyses.Comment: Accepted by AAAI 202
State-of-the-art generalisation research in NLP: a taxonomy and review
The ability to generalise well is one of the primary desiderata of natural
language processing (NLP). Yet, what `good generalisation' entails and how it
should be evaluated is not well understood, nor are there any common standards
to evaluate it. In this paper, we aim to lay the ground-work to improve both of
these issues. We present a taxonomy for characterising and understanding
generalisation research in NLP, we use that taxonomy to present a comprehensive
map of published generalisation studies, and we make recommendations for which
areas might deserve attention in the future. Our taxonomy is based on an
extensive literature review of generalisation research, and contains five axes
along which studies can differ: their main motivation, the type of
generalisation they aim to solve, the type of data shift they consider, the
source by which this data shift is obtained, and the locus of the shift within
the modelling pipeline. We use our taxonomy to classify over 400 previous
papers that test generalisation, for a total of more than 600 individual
experiments. Considering the results of this review, we present an in-depth
analysis of the current state of generalisation research in NLP, and make
recommendations for the future. Along with this paper, we release a webpage
where the results of our review can be dynamically explored, and which we
intend to up-date as new NLP generalisation studies are published. With this
work, we aim to make steps towards making state-of-the-art generalisation
testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference
Language Modelling with Pixels
Language models are defined over a finite set of inputs, which creates a
vocabulary bottleneck when we attempt to scale the number of supported
languages. Tackling this bottleneck results in a trade-off between what can be
represented in the embedding matrix and computational issues in the output
layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which
suffers from neither of these issues. PIXEL is a pretrained language model that
renders text as images, making it possible to transfer representations across
languages based on orthographic similarity or the co-activation of pixels.
PIXEL is trained to reconstruct the pixels of masked patches, instead of
predicting a distribution over tokens. We pretrain the 86M parameter PIXEL
model on the same English data as BERT and evaluate on syntactic and semantic
tasks in typologically diverse languages, including various non-Latin scripts.
We find that PIXEL substantially outperforms BERT on syntactic and semantic
processing tasks on scripts that are not found in the pretraining data, but
PIXEL is slightly weaker than BERT when working with Latin scripts.
Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT,
further confirming the benefits of modelling language with pixels.Comment: work in progres
Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking
A typical architecture for end-to-end entity linking systems consists of
three steps: mention detection, candidate generation and entity disambiguation.
In this study we investigate the following questions: (a) Can all those steps
be learned jointly with a model for contextualized text-representations, i.e.
BERT (Devlin et al., 2019)? (b) How much entity knowledge is already contained
in pretrained BERT? (c) Does additional entity knowledge improve BERT's
performance in downstream tasks? To this end, we propose an extreme
simplification of the entity linking setup that works surprisingly well: simply
cast it as a per token classification over the entire entity vocabulary (over
700K classes in our case). We show on an entity linking benchmark that (i) this
model improves the entity representations over plain BERT, (ii) that it
outperforms entity linking architectures that optimize the tasks separately and
(iii) that it only comes second to the current state-of-the-art that does
mention detection and entity disambiguation jointly. Additionally, we
investigate the usefulness of entity-aware token-representations in the
text-understanding benchmark GLUE, as well as the question answering benchmarks
SQUAD V2 and SWAG and also the EN-DE WMT14 machine translation benchmark. To
our surprise, we find that most of those benchmarks do not benefit from
additional entity knowledge, except for a task with very small training data,
the RTE task in GLUE, which improves by 2%.Comment: Published at CoNLL 201
- …