Search CORE

17,736 research outputs found

Contrastive Search Is What You Need For Neural Text Generation

Author: Collier Nigel
Su Yixuan
Publication venue
Publication date: 14/02/2023
Field of study

Generating text with autoregressive language models (LMs) is of great importance to many natural language processing (NLP) applications. Previous solutions for this task often produce text that contains degenerative expressions or lacks semantic consistency. Recently, Su et al. introduced a new decoding method, contrastive search, based on the isotropic representation space of the language model and obtained new state of the art on various benchmarks. Additionally, Su et al. argued that the representations of autoregressive LMs (e.g. GPT-2) are intrinsically anisotropic which is also shared by previous studies. Therefore, to ensure the language model follows an isotropic distribution, Su et al. proposed a contrastive learning scheme, SimCTG, which calibrates the language model's representations through additional training. In this study, we first answer the question: "Are autoregressive LMs really anisotropic?". To this end, we extensively evaluate the isotropy of LMs across 16 major languages. Surprisingly, we find that the anisotropic problem only exists in the two specific English GPT-2-small/medium models. On the other hand, all other evaluated LMs are naturally isotropic which is in contrast to the conclusion drawn by previous studies. Based on our findings, we further assess the contrastive search decoding method using off-the-shelf LMs on four generation tasks across 16 languages. Our experimental results demonstrate that contrastive search significantly outperforms previous decoding methods without any additional training. More notably, on 12 out of the 16 evaluated languages, contrastive search performs comparably with human-level performances as judged by human evaluations. Our code and other related resources are publicly available at https://github.com/yxuansu/Contrastive_Search_Is_What_You_Need.Comment: TMLR'2

arXiv.org e-Print Archive

Variational Learning for A Hierarchical Model of Conversations

Author: 박유군
Publication venue: 서울대학교 대학원
Publication date: 01/02/2019
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2019. 2. 김건희.계층적 회귀신경망과 (hierarchical RNNS) 결합된 Variational autoencoders (VAE) 는 대화 모델링을 위한 강력한 프레임워크를 제공한다. 그러나 이러한 모델은 잠재변수 (latent variable)을 무시하도록 학습되는 degeneration 문제를 겪는다. 우리는 실험적으로 이 문제에 크게 2가지 원인이 있는 것을 밝힌다. 첫째, 계층적 회귀신경망의 자기회귀적 (autoregressive) 분포 추정 능력이 매우 강력하기 때문에 잠재변수에 의존하지 않고도 데이터를 모델링 할 수 있다. 둘째, 문맥에 의존하는 conditional VAE 구조는 대화 문맥이 완전하게 주어지기 때문에 다음 발화를 거의 결정론적으로 추론할 수 있으며, 따라서 계층적 회귀신경망은 쉽게 학습 데이터에 과적합 (overfit) 할 수 있다. 이 문제를 해결하기 위하며 우리는 Variational Hierarchical Conversation RNNs (VHCR) 이라는 계층적 모델을 제시한다. 이 모델은 1) 잠재변수의 계층적 구조를 사용하는 것, 2) utterance drop regularization 을 사용하는 것의 2가지 중요한 아이디어를 활용한다. Cornel Move Dialog 와 Ubuntu Dialog Corpus 의 2가지 데이터셋에서 우리는 실험적으로 이 모델이 기존의 state-of-the-art 성능을 갱신하는 것을 보인다. 또한, 계층적인 잠재변수 구조는 대화 내의 발화 내용의 제어를 새로운 측면에서 가능케 한다.Variational autoencoders (VAE) combined with hierarchical RNNs have emerged as a powerful framework for conversation modeling. However, they suffer from the notorious degeneration problem, where the RNN decoders learn to ignore latent variables and reduce to vanilla RNNs. We empirically show that this degeneracy occurs mostly due to two reasons. First, the expressive power of hierarchical RNN decoders is often high enough to model the data using only its decoding distributions without relying on the role of latent variables to capture variability of data. Second, the context-conditional VAE structure whose utterance generation process is conditioned on the current context of conversation, deprives training targets of variabilitythat is, target utterances in the training corpus can be deterministically deduced from the context, making the RNN decoders prone to overfitting given their expressive power. To solve the degeneration problem, we propose a novel hierarchical model named Variational Hierarchical Conversation RNNs (VHCR), involving two key ideas of (1) using a hierarchical structure of latent variables, and (2) exploiting an utterance drop for regularization of hierarchical RNNs. With evaluations on two datasets of Cornell Movie Dialog and Ubuntu Dialog Corpus, we show that our VHCR successfully utilizes latent variables and outperforms state-of-the-art models for conversation generation. Moreover, it can perform several new utterance control tasks, thanks to its hierarchical latent structure.Abstract i Contents iii List of Figures v List of Tables vii Chapter 1 Introduction 1 Chapter 2 Related Works 5 2.1 Conversation Modeling . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Degeneracy of Variational Autoencoders . . . . . . . . . . . . . . 6 Chapter 3 Approach 7 3.1 Preliminary: Variational Autoencoder . . . . . . . . . . . . . . . 7 3.2 VHRED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 The Degeneration Problem . . . . . . . . . . . . . . . . . . . . . 10 3.4 Empirical Observation on Degeneracy . . . . . . . . . . . . . . . 12 3.5 Variational Hierarchical Conversation RNN (VHCR) . . . . . . . 14 3.6 Effectiveness of Hierarchical Latent Structure . . . . . . . . . . . 17 Chapter 4 Results 19 4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.3 Performance Measures . . . . . . . . . . . . . . . . . . . . 20 4.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . 20 4.1.5 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Results of Negative Log-likelihood . . . . . . . . . . . . . . . . . 21 4.3 Results of Embedding-Based Metrics . . . . . . . . . . . . . . . . 23 4.4 Results of Human Evaluation . . . . . . . . . . . . . . . . . . . . 25 4.5 Qualitative Analyses . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5.1 Comparison of Predicted Responses . . . . . . . . . . . . 26 4.5.2 Interpolation on Conversation Latent Variable . . . . . . 26 4.5.3 Generation with Fixed Conversation Latent Variable . . . 27 Chapter 5 Conclusion 28 요약 32 Acknowledgements 33Maste

SNU Open Repository and Archive

Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Author: Chronopoulou Alexandra
Fraser Alexander
Lai Wen
Publication venue
Publication date: 24/10/2023
Field of study

Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.Comment: Accepted to Findings of EMNLP 2023, add statistical significance tests. code available at https://github.com/lavine-lmu/Bi-AC

arXiv.org e-Print Archive

IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization

Author: Lin Bill Yuchen
Ren Xiang
Zhou Wenxuan
Publication venue
Publication date: 03/02/2021
Field of study

Fine-tuning pre-trained language models (PTLMs), such as BERT and its better variant RoBERTa, has been a common practice for advancing performance in natural language understanding (NLU) tasks. Recent advance in representation learning shows that isotropic (i.e., unit-variance and uncorrelated) embeddings can significantly improve performance on downstream tasks with faster convergence and better generalization. The isotropy of the pre-trained embeddings in PTLMs, however, is relatively under-explored. In this paper, we analyze the isotropy of the pre-trained [CLS] embeddings of PTLMs with straightforward visualization, and point out two major issues: high variance in their standard deviation, and high correlation between different dimensions. We also propose a new network regularization method, isotropic batch normalization (IsoBN) to address the issues, towards learning more isotropic representations in fine-tuning by dynamically penalizing dominating principal components. This simple yet effective fine-tuning method yields about 1.0 absolute increment on the average of seven NLU tasks.Comment: AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Is Anisotropy Inherent to Transformers?

Author: de la Clergerie Éric
Godey Nathan
Sagot Benoît
Publication venue
Publication date: 13/06/2023
Field of study

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.Comment: ACL-SRW 2023 (Poster

arXiv.org e-Print Archive

InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution

Author: Jian Xiangru
Wang Yimu
Publication venue
Publication date: 24/10/2023
Field of study

Over recent decades, significant advancements in cross-modal retrieval are mainly driven by breakthroughs in visual and linguistic modeling. However, a recent study shows that multi-modal data representations tend to cluster within a limited convex cone (as representation degeneration problem), which hinders retrieval performance due to the inseparability of these representations. In our study, we first empirically validate the presence of the representation degeneration problem across multiple cross-modal benchmarks and methods. Next, to address it, we introduce a novel method, called InvGC, a post-processing technique inspired by graph convolution and average pooling. Specifically, InvGC defines the graph topology within the datasets and then applies graph convolution in a subtractive manner. This method effectively separates representations by increasing the distances between data points. To improve the efficiency and effectiveness of InvGC, we propose an advanced graph topology, LocalAdj, which only aims to increase the distances between each data point and its nearest neighbors. To understand why InvGC works, we present a detailed theoretical analysis, proving that the lower bound of recall will be improved after deploying InvGC. Extensive empirical results show that InvGC and InvGC w/LocalAdj significantly mitigate the representation degeneration problem, thereby enhancing retrieval performance. Our code is available at https://github.com/yimuwangcs/Better_Cross_Modal_RetrievalComment: Findings of EMNLP 202

arXiv.org e-Print Archive