17,736 research outputs found

    Contrastive Search Is What You Need For Neural Text Generation

    Full text link
    Generating text with autoregressive language models (LMs) is of great importance to many natural language processing (NLP) applications. Previous solutions for this task often produce text that contains degenerative expressions or lacks semantic consistency. Recently, Su et al. introduced a new decoding method, contrastive search, based on the isotropic representation space of the language model and obtained new state of the art on various benchmarks. Additionally, Su et al. argued that the representations of autoregressive LMs (e.g. GPT-2) are intrinsically anisotropic which is also shared by previous studies. Therefore, to ensure the language model follows an isotropic distribution, Su et al. proposed a contrastive learning scheme, SimCTG, which calibrates the language model's representations through additional training. In this study, we first answer the question: "Are autoregressive LMs really anisotropic?". To this end, we extensively evaluate the isotropy of LMs across 16 major languages. Surprisingly, we find that the anisotropic problem only exists in the two specific English GPT-2-small/medium models. On the other hand, all other evaluated LMs are naturally isotropic which is in contrast to the conclusion drawn by previous studies. Based on our findings, we further assess the contrastive search decoding method using off-the-shelf LMs on four generation tasks across 16 languages. Our experimental results demonstrate that contrastive search significantly outperforms previous decoding methods without any additional training. More notably, on 12 out of the 16 evaluated languages, contrastive search performs comparably with human-level performances as judged by human evaluations. Our code and other related resources are publicly available at https://github.com/yxuansu/Contrastive_Search_Is_What_You_Need.Comment: TMLR'2

    Variational Learning for A Hierarchical Model of Conversations

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ๊น€๊ฑดํฌ.๊ณ„์ธต์  ํšŒ๊ท€์‹ ๊ฒฝ๋ง๊ณผ (hierarchical RNNS) ๊ฒฐํ•ฉ๋œ Variational autoencoders (VAE) ๋Š” ๋Œ€ํ™” ๋ชจ๋ธ๋ง์„ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ์ž ์žฌ๋ณ€์ˆ˜ (latent variable)์„ ๋ฌด์‹œํ•˜๋„๋ก ํ•™์Šต๋˜๋Š” degeneration ๋ฌธ์ œ๋ฅผ ๊ฒช๋Š”๋‹ค. ์šฐ๋ฆฌ๋Š” ์‹คํ—˜์ ์œผ๋กœ ์ด ๋ฌธ์ œ์— ํฌ๊ฒŒ 2๊ฐ€์ง€ ์›์ธ์ด ์žˆ๋Š” ๊ฒƒ์„ ๋ฐํžŒ๋‹ค. ์ฒซ์งธ, ๊ณ„์ธต์  ํšŒ๊ท€์‹ ๊ฒฝ๋ง์˜ ์ž๊ธฐํšŒ๊ท€์  (autoregressive) ๋ถ„ํฌ ์ถ”์ • ๋Šฅ๋ ฅ์ด ๋งค์šฐ ๊ฐ•๋ ฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž ์žฌ๋ณ€์ˆ˜์— ์˜์กดํ•˜์ง€ ์•Š๊ณ ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ๋ง ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‘˜์งธ, ๋ฌธ๋งฅ์— ์˜์กดํ•˜๋Š” conditional VAE ๊ตฌ์กฐ๋Š” ๋Œ€ํ™” ๋ฌธ๋งฅ์ด ์™„์ „ํ•˜๊ฒŒ ์ฃผ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์Œ ๋ฐœํ™”๋ฅผ ๊ฑฐ์˜ ๊ฒฐ์ •๋ก ์ ์œผ๋กœ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋”ฐ๋ผ์„œ ๊ณ„์ธต์  ํšŒ๊ท€์‹ ๊ฒฝ๋ง์€ ์‰ฝ๊ฒŒ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ (overfit) ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•˜๋ฉฐ ์šฐ๋ฆฌ๋Š” Variational Hierarchical Conversation RNNs (VHCR) ์ด๋ผ๋Š” ๊ณ„์ธต์  ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๋ชจ๋ธ์€ 1) ์ž ์žฌ๋ณ€์ˆ˜์˜ ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ, 2) utterance drop regularization ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์˜ 2๊ฐ€์ง€ ์ค‘์š”ํ•œ ์•„์ด๋””์–ด๋ฅผ ํ™œ์šฉํ•œ๋‹ค. Cornel Move Dialog ์™€ Ubuntu Dialog Corpus ์˜ 2๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์šฐ๋ฆฌ๋Š” ์‹คํ—˜์ ์œผ๋กœ ์ด ๋ชจ๋ธ์ด ๊ธฐ์กด์˜ state-of-the-art ์„ฑ๋Šฅ์„ ๊ฐฑ์‹ ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, ๊ณ„์ธต์ ์ธ ์ž ์žฌ๋ณ€์ˆ˜ ๊ตฌ์กฐ๋Š” ๋Œ€ํ™” ๋‚ด์˜ ๋ฐœํ™” ๋‚ด์šฉ์˜ ์ œ์–ด๋ฅผ ์ƒˆ๋กœ์šด ์ธก๋ฉด์—์„œ ๊ฐ€๋Šฅ์ผ€ ํ•œ๋‹ค.Variational autoencoders (VAE) combined with hierarchical RNNs have emerged as a powerful framework for conversation modeling. However, they suffer from the notorious degeneration problem, where the RNN decoders learn to ignore latent variables and reduce to vanilla RNNs. We empirically show that this degeneracy occurs mostly due to two reasons. First, the expressive power of hierarchical RNN decoders is often high enough to model the data using only its decoding distributions without relying on the role of latent variables to capture variability of data. Second, the context-conditional VAE structure whose utterance generation process is conditioned on the current context of conversation, deprives training targets of variabilitythat is, target utterances in the training corpus can be deterministically deduced from the context, making the RNN decoders prone to overfitting given their expressive power. To solve the degeneration problem, we propose a novel hierarchical model named Variational Hierarchical Conversation RNNs (VHCR), involving two key ideas of (1) using a hierarchical structure of latent variables, and (2) exploiting an utterance drop for regularization of hierarchical RNNs. With evaluations on two datasets of Cornell Movie Dialog and Ubuntu Dialog Corpus, we show that our VHCR successfully utilizes latent variables and outperforms state-of-the-art models for conversation generation. Moreover, it can perform several new utterance control tasks, thanks to its hierarchical latent structure.Abstract i Contents iii List of Figures v List of Tables vii Chapter 1 Introduction 1 Chapter 2 Related Works 5 2.1 Conversation Modeling . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Degeneracy of Variational Autoencoders . . . . . . . . . . . . . . 6 Chapter 3 Approach 7 3.1 Preliminary: Variational Autoencoder . . . . . . . . . . . . . . . 7 3.2 VHRED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 The Degeneration Problem . . . . . . . . . . . . . . . . . . . . . 10 3.4 Empirical Observation on Degeneracy . . . . . . . . . . . . . . . 12 3.5 Variational Hierarchical Conversation RNN (VHCR) . . . . . . . 14 3.6 Effectiveness of Hierarchical Latent Structure . . . . . . . . . . . 17 Chapter 4 Results 19 4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.3 Performance Measures . . . . . . . . . . . . . . . . . . . . 20 4.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . 20 4.1.5 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Results of Negative Log-likelihood . . . . . . . . . . . . . . . . . 21 4.3 Results of Embedding-Based Metrics . . . . . . . . . . . . . . . . 23 4.4 Results of Human Evaluation . . . . . . . . . . . . . . . . . . . . 25 4.5 Qualitative Analyses . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5.1 Comparison of Predicted Responses . . . . . . . . . . . . 26 4.5.2 Interpolation on Conversation Latent Variable . . . . . . 26 4.5.3 Generation with Fixed Conversation Latent Variable . . . 27 Chapter 5 Conclusion 28 ์š”์•ฝ 32 Acknowledgements 33Maste

    Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

    Full text link
    Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.Comment: Accepted to Findings of EMNLP 2023, add statistical significance tests. code available at https://github.com/lavine-lmu/Bi-AC

    IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization

    Full text link
    Fine-tuning pre-trained language models (PTLMs), such as BERT and its better variant RoBERTa, has been a common practice for advancing performance in natural language understanding (NLU) tasks. Recent advance in representation learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings can significantly improve performance on downstream tasks with faster convergence and better generalization. The isotropy of the pre-trained embeddings in PTLMs, however, is relatively under-explored. In this paper, we analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with straightforward visualization, and point out two major issues: high variance in their standard deviation, and high correlation between different dimensions. We also propose a new network regularization method, isotropic batch normalization (IsoBN) to address the issues, towards learning more isotropic representations in fine-tuning by dynamically penalizing dominating principal components. This simple yet effective fine-tuning method yields about 1.0 absolute increment on the average of seven NLU tasks.Comment: AAAI 202

    Is Anisotropy Inherent to Transformers?

    Full text link
    The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.Comment: ACL-SRW 2023 (Poster

    InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution

    Full text link
    Over recent decades, significant advancements in cross-modal retrieval are mainly driven by breakthroughs in visual and linguistic modeling. However, a recent study shows that multi-modal data representations tend to cluster within a limited convex cone (as representation degeneration problem), which hinders retrieval performance due to the inseparability of these representations. In our study, we first empirically validate the presence of the representation degeneration problem across multiple cross-modal benchmarks and methods. Next, to address it, we introduce a novel method, called InvGC, a post-processing technique inspired by graph convolution and average pooling. Specifically, InvGC defines the graph topology within the datasets and then applies graph convolution in a subtractive manner. This method effectively separates representations by increasing the distances between data points. To improve the efficiency and effectiveness of InvGC, we propose an advanced graph topology, LocalAdj, which only aims to increase the distances between each data point and its nearest neighbors. To understand why InvGC works, we present a detailed theoretical analysis, proving that the lower bound of recall will be improved after deploying InvGC. Extensive empirical results show that InvGC and InvGC w/LocalAdj significantly mitigate the representation degeneration problem, thereby enhancing retrieval performance. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_RetrievalComment: Findings of EMNLP 202
    • โ€ฆ
    corecore