10,608 research outputs found
Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing
We introduce a novel method for multilingual transfer that utilizes deep
contextual embeddings, pretrained in an unsupervised fashion. While contextual
embeddings have been shown to yield richer representations of meaning compared
to their static counterparts, aligning them poses a challenge due to their
dynamic nature. To this end, we construct context-independent variants of the
original monolingual spaces and utilize their mapping to derive an alignment
for the context-dependent spaces. This mapping readily supports processing of a
target language, improving transfer by context-aware embeddings. Our
experimental results demonstrate the effectiveness of this approach for
zero-shot and few-shot learning of dependency parsing. Specifically, our method
consistently outperforms the previous state-of-the-art on 6 tested languages,
yielding an improvement of 6.8 LAS points on average.Comment: NAACL 201
Do Transformers Parse while Predicting the Masked Word?
Pre-trained language models have been shown to encode linguistic structures,
e.g. dependency and constituency parse trees, in their embeddings while being
trained on unsupervised loss functions like masked language modeling. Some
doubts have been raised whether the models actually are doing parsing or only
some computation weakly correlated with it. We study questions: (a) Is it
possible to explicitly describe transformers with realistic embedding
dimension, number of heads, etc. that are capable of doing parsing -- or even
approximate parsing? (b) Why do pre-trained models capture parsing structure?
This paper takes a step toward answering these questions in the context of
generative modeling with PCFGs. We show that masked language models like BERT
or RoBERTa of moderate sizes can approximately execute the Inside-Outside
algorithm for the English PCFG [Marcus et al, 1993]. We also show that the
Inside-Outside algorithm is optimal for masked language modeling loss on the
PCFG-generated data. We also give a construction of transformers with
layers, attention heads, and dimensional embeddings in average such
that using its embeddings it is possible to do constituency parsing with
F1 score on PTB dataset. We conduct probing experiments on models
pre-trained on PCFG-generated data to show that this not only allows recovery
of approximate parse tree, but also recovers marginal span probabilities
computed by the Inside-Outside algorithm, which suggests an implicit bias of
masked language modeling towards this algorithm
Learning to Embed Words in Context for Syntactic Tasks
We present models for embedding words in the context of surrounding words.
Such models, which we refer to as token embeddings, represent the
characteristics of a word that are specific to a given context, such as word
sense, syntactic category, and semantic role. We explore simple, efficient
token embedding models based on standard neural network architectures. We learn
token embeddings on a large amount of unannotated text and evaluate them as
features for part-of-speech taggers and dependency parsers trained on much
smaller amounts of annotated data. We find that predictors endowed with token
embeddings consistently outperform baseline predictors across a range of
context window and training set sizes.Comment: Accepted by ACL 2017 Repl4NLP worksho
Robust Multilingual Part-of-Speech Tagging via Adversarial Training
Adversarial training (AT) is a powerful regularization method for neural
networks, aiming to achieve robustness to input perturbations. Yet, the
specific effects of the robustness obtained from AT are still unclear in the
context of natural language processing. In this paper, we propose and analyze a
neural POS tagging model that exploits AT. In our experiments on the Penn
Treebank WSJ corpus and the Universal Dependencies (UD) dataset (27 languages),
we find that AT not only improves the overall tagging accuracy, but also 1)
prevents over-fitting well in low resource languages and 2) boosts tagging
accuracy for rare / unseen words. We also demonstrate that 3) the improved
tagging performance by AT contributes to the downstream task of dependency
parsing, and that 4) AT helps the model to learn cleaner word representations.
5) The proposed AT model is generally effective in different sequence labeling
tasks. These positive results motivate further use of AT for natural language
tasks.Comment: NAACL 201
- …