45 research outputs found
Generative and Discriminative Text Classification with Recurrent Neural Networks
We empirically characterize the performance of discriminative and generative
LSTM models for text classification. We find that although RNN-based generative
models are more powerful than their bag-of-words ancestors (e.g., they account
for conditional dependencies across words in a document), they have higher
asymptotic error rates than discriminatively trained RNN models. However we
also find that generative models approach their asymptotic error rate more
rapidly than their discriminative counterparts---the same pattern that Ng &
Jordan (2001) proved holds for linear classification models that make more
naive conditional independence assumptions. Building on this finding, we
hypothesize that RNN-based generative classification models will be more robust
to shifts in the data distribution. This hypothesis is confirmed in a series of
experiments in zero-shot and continual learning settings that show that
generative models substantially outperform discriminative models
Learning Word Representations with Hierarchical Sparse Coding
We propose a new method for learning word representations using hierarchical
regularization in sparse coding inspired by the linguistic study of word
meanings. We show an efficient learning algorithm based on stochastic proximal
methods that is significantly faster than previous approaches, making it
possible to perform hierarchical sparse coding on a corpus of billions of word
tokens. Experiments on various benchmark tasks---word similarity ranking,
analogies, sentence completion, and sentiment analysis---demonstrate that the
method outperforms or is competitive with state-of-the-art methods. Our word
representations are available at
\url{http://www.ark.cs.cmu.edu/dyogatam/wordvecs/}
The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining
We analyze the masked language modeling pretraining objective function from
the perspective of the distributional hypothesis. We investigate whether better
sample efficiency and the better generalization capability of models pretrained
with masked language modeling can be attributed to the semantic similarity
encoded in the pretraining data's distributional property. Via a synthetic
dataset, our analysis suggests that distributional property indeed leads to the
better sample efficiency of pretrained masked language models, but does not
fully explain the generalization capability. We also conduct analyses over two
real-world datasets and demonstrate that the distributional property does not
explain the generalization ability of pretrained natural language models
either. Our results illustrate our limited understanding of model pretraining
and provide future research directions.Comment: EMNLP 202
On the Cross-lingual Transferability of Monolingual Representations
State-of-the-art unsupervised multilingual models (e.g., multilingual BERT)
have been shown to generalize in a zero-shot cross-lingual setting. This
generalization ability has been attributed to the use of a shared subword
vocabulary and joint training across multiple languages giving rise to deep
multilingual abstractions. We evaluate this hypothesis by designing an
alternative approach that transfers a monolingual model to new languages at the
lexical level. More concretely, we first train a transformer-based masked
language model on one language, and transfer it to a new language by learning a
new embedding matrix with the same masked language modeling objective, freezing
parameters of all other layers. This approach does not rely on a shared
vocabulary or joint training. However, we show that it is competitive with
multilingual BERT on standard cross-lingual classification benchmarks and on a
new Cross-lingual Question Answering Dataset (XQuAD). Our results contradict
common beliefs of the basis of the generalization ability of multilingual
models and suggest that deep monolingual models learn some abstractions that
generalize across languages. We also release XQuAD as a more comprehensive
cross-lingual benchmark, which comprises 240 paragraphs and 1190
question-answer pairs from SQuAD v1.1 translated into ten languages by
professional translators.Comment: ACL 202