21 research outputs found
Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
With the recent success and popularity of pre-trained language models (LMs)
in natural language processing, there has been a rise in efforts to understand
their inner workings. In line with such interest, we propose a novel method
that assists us in investigating the extent to which pre-trained LMs capture
the syntactic notion of constituency. Our method provides an effective way of
extracting constituency trees from the pre-trained LMs without training. In
addition, we report intriguing findings in the induced trees, including the
fact that pre-trained LMs outperform other approaches in correctly demarcating
adverb phrases in sentences.Comment: ICLR 202
Self-Training for Unsupervised Parsing with PRPN
Neural unsupervised parsing (UP) models learn to parse without access to
syntactic annotations, while being optimized for another task like language
modeling. In this work, we propose self-training for neural UP models: we
leverage aggregated annotations predicted by copies of our model as supervision
for future copies. To be able to use our model's predictions during training,
we extend a recent neural UP architecture, the PRPN (Shen et al., 2018a) such
that it can be trained in a semi-supervised fashion. We then add examples with
parses predicted by our model to our unlabeled UP training data. Our
self-trained model outperforms the PRPN by 8.1% F1 and the previous state of
the art by 1.6% F1. In addition, we show that our architecture can also be
helpful for semi-supervised parsing in ultra-low-resource settings.Comment: Accepted for publication at the 16th International Conference on
Parsing Technologies (IWPT), 202
IDS at SemEval-2020 Task 10: Does Pre-trained Language Model Know What to Emphasize?
We propose a novel method that enables us to determine words that deserve to
be emphasized from written text in visual media, relying only on the
information from the self-attention distributions of pre-trained language
models (PLMs). With extensive experiments and analyses, we show that 1) our
zero-shot approach is superior to a reasonable baseline that adopts TF-IDF and
that 2) there exist several attention heads in PLMs specialized for emphasis
selection, confirming that PLMs are capable of recognizing important words in
sentences
On the Branching Bias of Syntax Extracted from Pre-trained Language Models
Many efforts have been devoted to extracting constituency trees from
pre-trained language models, often proceeding in two stages: feature definition
and parsing. However, this kind of methods may suffer from the branching bias
issue, which will inflate the performances on languages with the same branch it
biases to. In this work, we propose quantitatively measuring the branching bias
by comparing the performance gap on a language and its reversed language, which
is agnostic to both language models and extracting methods. Furthermore, we
analyze the impacts of three factors on the branching bias, namely parsing
algorithms, feature definitions, and language models. Experiments show that
several existing works exhibit branching biases, and some implementations of
these three factors can introduce the branching bias.Comment: EMNLP 2020 finding
Multilingual Chart-based Constituency Parse Extraction from Pre-trained Language Models
As it has been unveiled that pre-trained language models (PLMs) are to some
extent capable of recognizing syntactic concepts in natural language, much
effort has been made to develop a method for extracting complete (binary)
parses from PLMs without training separate parsers. We improve upon this
paradigm by proposing a novel chart-based method and an effective top-K
ensemble technique. Moreover, we demonstrate that we can broaden the scope of
application of the approach into multilingual settings. Specifically, we show
that by applying our method on multilingual PLMs, it becomes possible to induce
non-trivial parses for sentences from nine languages in an integrated and
language-agnostic manner, attaining performance superior or comparable to that
of unsupervised PCFGs. We also verify that our approach is robust to
cross-lingual transfer. Finally, we provide analyses on the inner workings of
our method. For instance, we discover universal attention heads which are
consistently sensitive to syntactic information irrespective of the input
language.Comment: preprin
Visually Analyzing Contextualized Embeddings
In this paper we introduce a method for visually analyzing contextualized
embeddings produced by deep neural network-based language models. Our approach
is inspired by linguistic probes for natural language processing, where tasks
are designed to probe language models for linguistic structure, such as
parts-of-speech and named entities. These approaches are largely confirmatory,
however, only enabling a user to test for information known a priori. In this
work, we eschew supervised probing tasks, and advocate for unsupervised probes,
coupled with visual exploration techniques, to assess what is learned by
language models. Specifically, we cluster contextualized embeddings produced
from a large text corpus, and introduce a visualization design based on this
clustering and textual structure - cluster co-occurrences, cluster spans, and
cluster-word membership - to help elicit the functionality of, and relationship
between, individual clusters. User feedback highlights the benefits of our
design in discovering different types of linguistic structures.Comment: IEEE Vis 2020, Observable notebook demo at
https://observablehq.com/@mattberger/visually-analyzing-contextualized-embedding
Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation
Recent work on the lottery ticket hypothesis has produced highly sparse
Transformers for NMT while maintaining BLEU. However, it is unclear how such
pruning techniques affect a model's learned representations. By probing sparse
Transformers, we find that complex semantic information is first to be
degraded. Analysis of internal activations reveals that higher layers diverge
most over the course of pruning, gradually becoming less complex than their
dense counterparts. Meanwhile, early layers of sparse models begin to perform
more encoding. Attention mechanisms remain remarkably consistent as sparsity
increases.Comment: 8 pages, 6 figures, 11 supplementary figure
Syntax Representation in Word Embeddings and Neural Networks -- A Survey
Neural networks trained on natural language processing tasks capture syntax
even though it is not provided as a supervision signal. This indicates that
syntactic analysis is essential to the understating of language in artificial
intelligence systems. This overview paper covers approaches of evaluating the
amount of syntactic information included in the representations of words for
different neural network architectures. We mainly summarize re-search on
English monolingual data on language modeling tasks and multilingual data for
neural machine translation systems and multilingual language models. We
describe which pre-trained models and representations of language are best
suited for transfer to syntactic tasks
Analyzing Individual Neurons in Pre-trained Language Models
While a lot of analysis has been carried to demonstrate linguistic knowledge
captured by the representations learned within deep NLP models, very little
attention has been paid towards individual neurons.We carry outa neuron-level
analysis using core linguistic tasks of predicting morphology, syntax and
semantics, on pre-trained language models, with questions like: i) do
individual neurons in pre-trained models capture linguistic information? ii)
which parts of the network learn more about certain linguistic phenomena? iii)
how distributed or focused is the information? and iv) how do various
architectures differ in learning these properties? We found small subsets of
neurons to predict linguistic tasks, with lower level tasks (such as
morphology) localized in fewer neurons, compared to higher level task of
predicting syntax. Our study also reveals interesting cross architectural
comparisons. For example, we found neurons in XLNet to be more localized and
disjoint when predicting properties compared to BERT and others, where they are
more distributed and coupled.Comment: Accepted in EMNLP 202
A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages
This work describes experiments which probe the hidden representations of
several BERT-style models for morphological content. The goal is to examine the
extent to which discrete linguistic structure, in the form of morphological
features and feature values, presents itself in the vector representations and
attention distributions of pre-trained language models for five European
languages. The experiments contained herein show that (i) Transformer
architectures largely partition their embedding space into convex sub-regions
highly correlated with morphological feature value, (ii) the contextualized
nature of transformer embeddings allows models to distinguish ambiguous
morphological forms in many, but not all cases, and (iii) very specific
attention head/layer combinations appear to hone in on subject-verb agreement