66 research outputs found
Auto-Encoding Variational Neural Machine Translation
We present a deep generative model of bilingual sentence pairs for machine
translation. The model generates source and target sentences jointly from a
shared latent representation and is parameterised by neural networks. We
perform efficient training using amortised variational inference and
reparameterised gradients. Additionally, we discuss the statistical
implications of joint modelling and propose an efficient approximation to
maximum a posteriori decoding for fast test-time predictions. We demonstrate
the effectiveness of our model in three machine translation scenarios:
in-domain training, mixed-domain training, and learning from a mix of
gold-standard and synthetic data. Our experiments show consistently that our
joint formulation outperforms conditional modelling (i.e. standard neural
machine translation) in all such scenarios
From general language understanding to noisy text comprehension
Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors. © 2021 by the authors. Licensee MDPI, Basel, Switzerland
Language Modelling with Pixels
Language models are defined over a finite set of inputs, which creates a
vocabulary bottleneck when we attempt to scale the number of supported
languages. Tackling this bottleneck results in a trade-off between what can be
represented in the embedding matrix and computational issues in the output
layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which
suffers from neither of these issues. PIXEL is a pretrained language model that
renders text as images, making it possible to transfer representations across
languages based on orthographic similarity or the co-activation of pixels.
PIXEL is trained to reconstruct the pixels of masked patches, instead of
predicting a distribution over tokens. We pretrain the 86M parameter PIXEL
model on the same English data as BERT and evaluate on syntactic and semantic
tasks in typologically diverse languages, including various non-Latin scripts.
We find that PIXEL substantially outperforms BERT on syntactic and semantic
processing tasks on scripts that are not found in the pretraining data, but
PIXEL is slightly weaker than BERT when working with Latin scripts.
Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT,
further confirming the benefits of modelling language with pixels.Comment: work in progres
Generating CCG Categories
Previous CCG supertaggers usually predict categories using multi-class
classification. Despite their simplicity, internal structures of categories are
usually ignored. The rich semantics inside these structures may help us to
better handle relations among categories and bring more robustness into
existing supertaggers. In this work, we propose to generate categories rather
than classify them: each category is decomposed into a sequence of smaller
atomic tags, and the tagger aims to generate the correct sequence. We show that
with this finer view on categories, annotations of different categories could
be shared and interactions with sentence contexts could be enhanced. The
proposed category generator is able to achieve state-of-the-art tagging (95.5%
accuracy) and parsing (89.8% labeled F1) performances on the standard CCGBank.
Furthermore, its performances on infrequent (even unseen) categories,
out-of-domain texts and low resource language give promising results on
introducing generation models to the general CCG analyses.Comment: Accepted by AAAI 202
Neural Unsupervised Domain Adaptation in NLP—A Survey
Deep neural networks excel at learning from labeled data and achieve
state-of-the-art results on a wide array of Natural Language Processing tasks.
In contrast, learning from unlabeled data, especially under domain shift,
remains a challenge. Motivated by the latest advances, in this survey we review
neural unsupervised domain adaptation techniques which do not require labeled
target domain data. This is a more challenging yet a more widely applicable
setup. We outline methods, from early approaches in traditional non-neural
methods to pre-trained model transfer. We also revisit the notion of domain,
and we uncover a bias in the type of Natural Language Processing tasks which
received most attention. Lastly, we outline future directions, particularly the
broader need for out-of-distribution generalization of future intelligent NLP
Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking
A typical architecture for end-to-end entity linking systems consists of
three steps: mention detection, candidate generation and entity disambiguation.
In this study we investigate the following questions: (a) Can all those steps
be learned jointly with a model for contextualized text-representations, i.e.
BERT (Devlin et al., 2019)? (b) How much entity knowledge is already contained
in pretrained BERT? (c) Does additional entity knowledge improve BERT's
performance in downstream tasks? To this end, we propose an extreme
simplification of the entity linking setup that works surprisingly well: simply
cast it as a per token classification over the entire entity vocabulary (over
700K classes in our case). We show on an entity linking benchmark that (i) this
model improves the entity representations over plain BERT, (ii) that it
outperforms entity linking architectures that optimize the tasks separately and
(iii) that it only comes second to the current state-of-the-art that does
mention detection and entity disambiguation jointly. Additionally, we
investigate the usefulness of entity-aware token-representations in the
text-understanding benchmark GLUE, as well as the question answering benchmarks
SQUAD V2 and SWAG and also the EN-DE WMT14 machine translation benchmark. To
our surprise, we find that most of those benchmarks do not benefit from
additional entity knowledge, except for a task with very small training data,
the RTE task in GLUE, which improves by 2%.Comment: Published at CoNLL 201
Robust input representations for low-resource information extraction
Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jüngsten Fortschritte auf dem Gebiet der Verarbeitung natürlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies führte zu einer Vielzahl neuer Forschungsfragen bezüglich der Stabilität solcher großen Systeme und ihrer Anwendbarkeit über gut untersuchte Aufgaben und Datensätze hinaus, wie z. B. die Informationsextraktion für Nicht-Standardsprachen, aber auch Textdomänen und Aufgaben, für die selbst im Englischen nur wenige Trainingsdaten zur Verfügung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige Beiträge in Bereichen wie Repräsentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende Beschränkungen zu überwinden, darunter fehlende Trainingsressourcen, ungesehene Domänen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die Domänenlücke zwischen Repräsentationsmodellen zu schließen, z.B. durch domänenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen Repräsentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die Leistungsfähigkeit unserer Methoden für verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und Domänen
- …