17,736 research outputs found
Contrastive Search Is What You Need For Neural Text Generation
Generating text with autoregressive language models (LMs) is of great
importance to many natural language processing (NLP) applications. Previous
solutions for this task often produce text that contains degenerative
expressions or lacks semantic consistency. Recently, Su et al. introduced a new
decoding method, contrastive search, based on the isotropic representation
space of the language model and obtained new state of the art on various
benchmarks. Additionally, Su et al. argued that the representations of
autoregressive LMs (e.g. GPT-2) are intrinsically anisotropic which is also
shared by previous studies. Therefore, to ensure the language model follows an
isotropic distribution, Su et al. proposed a contrastive learning scheme,
SimCTG, which calibrates the language model's representations through
additional training.
In this study, we first answer the question: "Are autoregressive LMs really
anisotropic?". To this end, we extensively evaluate the isotropy of LMs across
16 major languages. Surprisingly, we find that the anisotropic problem only
exists in the two specific English GPT-2-small/medium models. On the other
hand, all other evaluated LMs are naturally isotropic which is in contrast to
the conclusion drawn by previous studies. Based on our findings, we further
assess the contrastive search decoding method using off-the-shelf LMs on four
generation tasks across 16 languages. Our experimental results demonstrate that
contrastive search significantly outperforms previous decoding methods without
any additional training. More notably, on 12 out of the 16 evaluated languages,
contrastive search performs comparably with human-level performances as judged
by human evaluations. Our code and other related resources are publicly
available at https://github.com/yxuansu/Contrastive_Search_Is_What_You_Need.Comment: TMLR'2
Variational Learning for A Hierarchical Model of Conversations
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ๊น๊ฑดํฌ.๊ณ์ธต์ ํ๊ท์ ๊ฒฝ๋ง๊ณผ (hierarchical RNNS) ๊ฒฐํฉ๋ Variational autoencoders (VAE) ๋ ๋ํ ๋ชจ๋ธ๋ง์ ์ํ ๊ฐ๋ ฅํ ํ๋ ์์ํฌ๋ฅผ ์ ๊ณตํ๋ค. ๊ทธ๋ฌ๋ ์ด๋ฌํ ๋ชจ๋ธ์ ์ ์ฌ๋ณ์ (latent variable)์ ๋ฌด์ํ๋๋ก ํ์ต๋๋ degeneration ๋ฌธ์ ๋ฅผ ๊ฒช๋๋ค. ์ฐ๋ฆฌ๋ ์คํ์ ์ผ๋ก ์ด ๋ฌธ์ ์ ํฌ๊ฒ 2๊ฐ์ง ์์ธ์ด ์๋ ๊ฒ์ ๋ฐํ๋ค. ์ฒซ์งธ, ๊ณ์ธต์ ํ๊ท์ ๊ฒฝ๋ง์ ์๊ธฐํ๊ท์ (autoregressive) ๋ถํฌ ์ถ์ ๋ฅ๋ ฅ์ด ๋งค์ฐ ๊ฐ๋ ฅํ๊ธฐ ๋๋ฌธ์ ์ ์ฌ๋ณ์์ ์์กดํ์ง ์๊ณ ๋ ๋ฐ์ดํฐ๋ฅผ ๋ชจ๋ธ๋ง ํ ์ ์๋ค. ๋์งธ, ๋ฌธ๋งฅ์ ์์กดํ๋ conditional VAE ๊ตฌ์กฐ๋ ๋ํ ๋ฌธ๋งฅ์ด ์์ ํ๊ฒ ์ฃผ์ด์ง๊ธฐ ๋๋ฌธ์ ๋ค์ ๋ฐํ๋ฅผ ๊ฑฐ์ ๊ฒฐ์ ๋ก ์ ์ผ๋ก ์ถ๋ก ํ ์ ์์ผ๋ฉฐ, ๋ฐ๋ผ์ ๊ณ์ธต์ ํ๊ท์ ๊ฒฝ๋ง์ ์ฝ๊ฒ ํ์ต ๋ฐ์ดํฐ์ ๊ณผ์ ํฉ (overfit) ํ ์ ์๋ค. ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ๋ฉฐ ์ฐ๋ฆฌ๋ Variational Hierarchical Conversation RNNs (VHCR) ์ด๋ผ๋ ๊ณ์ธต์ ๋ชจ๋ธ์ ์ ์ํ๋ค. ์ด ๋ชจ๋ธ์ 1) ์ ์ฌ๋ณ์์ ๊ณ์ธต์ ๊ตฌ์กฐ๋ฅผ ์ฌ์ฉํ๋ ๊ฒ, 2) utterance drop regularization ์ ์ฌ์ฉํ๋ ๊ฒ์ 2๊ฐ์ง ์ค์ํ ์์ด๋์ด๋ฅผ ํ์ฉํ๋ค. Cornel Move Dialog ์ Ubuntu Dialog Corpus ์ 2๊ฐ์ง ๋ฐ์ดํฐ์
์์ ์ฐ๋ฆฌ๋ ์คํ์ ์ผ๋ก ์ด ๋ชจ๋ธ์ด ๊ธฐ์กด์ state-of-the-art ์ฑ๋ฅ์ ๊ฐฑ์ ํ๋ ๊ฒ์ ๋ณด์ธ๋ค. ๋ํ, ๊ณ์ธต์ ์ธ ์ ์ฌ๋ณ์ ๊ตฌ์กฐ๋ ๋ํ ๋ด์ ๋ฐํ ๋ด์ฉ์ ์ ์ด๋ฅผ ์๋ก์ด ์ธก๋ฉด์์ ๊ฐ๋ฅ์ผ ํ๋ค.Variational autoencoders (VAE) combined with hierarchical RNNs have emerged as a powerful framework for conversation modeling. However, they suffer from the notorious degeneration problem, where the RNN decoders learn to ignore latent variables and reduce to vanilla RNNs. We empirically show that this degeneracy occurs mostly due to two reasons. First, the expressive power of hierarchical RNN decoders is often high enough to model the data using only its decoding distributions without relying on the role of latent variables to capture variability of data. Second, the context-conditional VAE structure whose utterance generation process is conditioned on the current context of conversation, deprives training targets of variabilitythat is, target utterances in the training corpus can be deterministically deduced from the context, making the RNN decoders prone to overfitting given their expressive power. To solve the degeneration problem, we propose a novel hierarchical model named Variational Hierarchical Conversation RNNs (VHCR), involving two key ideas of (1) using a hierarchical structure of latent variables, and (2) exploiting an utterance drop for regularization of hierarchical RNNs. With evaluations on two datasets of Cornell Movie Dialog and Ubuntu Dialog Corpus, we show that our VHCR successfully utilizes latent variables and outperforms state-of-the-art models for
conversation generation. Moreover, it can perform several new utterance control tasks, thanks to its hierarchical latent structure.Abstract i
Contents iii
List of Figures v
List of Tables vii
Chapter 1 Introduction 1
Chapter 2 Related Works 5
2.1 Conversation Modeling . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Degeneracy of Variational Autoencoders . . . . . . . . . . . . . . 6
Chapter 3 Approach 7
3.1 Preliminary: Variational Autoencoder . . . . . . . . . . . . . . . 7
3.2 VHRED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 The Degeneration Problem . . . . . . . . . . . . . . . . . . . . . 10
3.4 Empirical Observation on Degeneracy . . . . . . . . . . . . . . . 12
3.5 Variational Hierarchical Conversation RNN (VHCR) . . . . . . . 14
3.6 Effectiveness of Hierarchical Latent Structure . . . . . . . . . . . 17
Chapter 4 Results 19
4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 Performance Measures . . . . . . . . . . . . . . . . . . . . 20
4.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . 20
4.1.5 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Results of Negative Log-likelihood . . . . . . . . . . . . . . . . . 21
4.3 Results of Embedding-Based Metrics . . . . . . . . . . . . . . . . 23
4.4 Results of Human Evaluation . . . . . . . . . . . . . . . . . . . . 25
4.5 Qualitative Analyses . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.1 Comparison of Predicted Responses . . . . . . . . . . . . 26
4.5.2 Interpolation on Conversation Latent Variable . . . . . . 26
4.5.3 Generation with Fixed Conversation Latent Variable . . . 27
Chapter 5 Conclusion 28
์์ฝ 32
Acknowledgements 33Maste
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation
Despite advances in multilingual neural machine translation (MNMT), we argue
that there are still two major challenges in this area: data imbalance and
representation degeneration. The data imbalance problem refers to the imbalance
in the amount of parallel corpora for all language pairs, especially for
long-tail languages (i.e., very low-resource languages). The representation
degeneration problem refers to the problem of encoded tokens tending to appear
only in a small subspace of the full space available to the MNMT model. To
solve these two issues, we propose Bi-ACL, a framework that uses only
target-side monolingual data and a bilingual dictionary to improve the
performance of the MNMT model. We define two modules, named bidirectional
autoencoder and bidirectional contrastive learning, which we combine with an
online constrained beam search and a curriculum learning sampling strategy.
Extensive experiments show that our proposed method is more effective both in
long-tail languages and in high-resource languages. We also demonstrate that
our approach is capable of transferring knowledge between domains and languages
in zero-shot scenarios.Comment: Accepted to Findings of EMNLP 2023, add statistical significance
tests. code available at https://github.com/lavine-lmu/Bi-AC
IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization
Fine-tuning pre-trained language models (PTLMs), such as BERT and its better
variant RoBERTa, has been a common practice for advancing performance in
natural language understanding (NLU) tasks. Recent advance in representation
learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings
can significantly improve performance on downstream tasks with faster
convergence and better generalization. The isotropy of the pre-trained
embeddings in PTLMs, however, is relatively under-explored. In this paper, we
analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with
straightforward visualization, and point out two major issues: high variance in
their standard deviation, and high correlation between different dimensions. We
also propose a new network regularization method, isotropic batch normalization
(IsoBN) to address the issues, towards learning more isotropic representations
in fine-tuning by dynamically penalizing dominating principal components. This
simple yet effective fine-tuning method yields about 1.0 absolute increment on
the average of seven NLU tasks.Comment: AAAI 202
Is Anisotropy Inherent to Transformers?
The representation degeneration problem is a phenomenon that is widely
observed among self-supervised learning methods based on Transformers. In NLP,
it takes the form of anisotropy, a singular property of hidden representations
which makes them unexpectedly close to each other in terms of angular distance
(cosine-similarity). Some recent works tend to show that anisotropy is a
consequence of optimizing the cross-entropy loss on long-tailed distributions
of tokens. We show in this paper that anisotropy can also be observed
empirically in language models with specific objectives that should not suffer
directly from the same consequences. We also show that the anisotropy problem
extends to Transformers trained on other modalities. Our observations tend to
demonstrate that anisotropy might actually be inherent to Transformers-based
models.Comment: ACL-SRW 2023 (Poster
InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution
Over recent decades, significant advancements in cross-modal retrieval are
mainly driven by breakthroughs in visual and linguistic modeling. However, a
recent study shows that multi-modal data representations tend to cluster within
a limited convex cone (as representation degeneration problem), which hinders
retrieval performance due to the inseparability of these representations. In
our study, we first empirically validate the presence of the representation
degeneration problem across multiple cross-modal benchmarks and methods. Next,
to address it, we introduce a novel method, called InvGC, a post-processing
technique inspired by graph convolution and average pooling. Specifically,
InvGC defines the graph topology within the datasets and then applies graph
convolution in a subtractive manner. This method effectively separates
representations by increasing the distances between data points. To improve the
efficiency and effectiveness of InvGC, we propose an advanced graph topology,
LocalAdj, which only aims to increase the distances between each data point and
its nearest neighbors. To understand why InvGC works, we present a detailed
theoretical analysis, proving that the lower bound of recall will be improved
after deploying InvGC. Extensive empirical results show that InvGC and InvGC
w/LocalAdj significantly mitigate the representation degeneration problem,
thereby enhancing retrieval performance.
Our code is available at
https://github.com/yimuwangcs/Better_Cross_Modal_RetrievalComment: Findings of EMNLP 202
- โฆ