Search CORE

826 research outputs found

Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

Author: Crothers Evan
Japkowicz Nathalie
Viktor Herna
Publication venue
Publication date: 15/02/2023
Field of study

Machine generated text is increasingly difficult to distinguish from human authored text. Powerful open-source models are freely available, and user-friendly tools that democratize access to generative models are proliferating. ChatGPT, which was released shortly after the first preprint of this survey, epitomizes these trends. The great potential of state-of-the-art natural language generation (NLG) systems is tempered by the multitude of avenues for abuse. Detection of machine generated text is a key countermeasure for reducing abuse of NLG models, with significant technical challenges and numerous open problems. We provide a survey that includes both 1) an extensive analysis of threat models posed by contemporary NLG systems, and 2) the most complete review of machine generated text detection methods to date. This survey places machine generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models, and ensuring detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.Comment: Manuscript submitted to ACM Special Session on Trustworthy AI. 2022/11/19 - Updated reference

arXiv.org e-Print Archive

Improving both domain robustness and domain adaptability in machine translation

Author: Fraser Alexander
Lai Wen
Libovický Jindřich
Publication venue
Publication date: 15/12/2021
Field of study

We address two problems of domain adaptation in neural machine translation. First, we want to reach domain robustness, i.e., good quality of both domains from the training data, and domains unseen in the training data. Second, we want our systems to be adaptive, i.e., making it possible to finetune systems with just hundreds of in-domain parallel sentences. In this paper, we introduce a novel combination of two previous approaches, word adaptive modelling, which addresses domain robustness, and meta-learning, which addresses domain adaptability, and we present empirical results showing that our new combination improves both of these properties

arXiv.org e-Print Archive

Biblio at Institute of Formal and Applied Linguistics

Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective

Author: Le Thai
Lee Dongwon
Uchendu Adaku
Publication venue
Publication date: 13/03/2023
Field of study

Two interlocking research questions of growing interest and importance in privacy research are Authorship Attribution (AA) and Authorship Obfuscation (AO). Given an artifact, especially a text t in question, an AA solution aims to accurately attribute t to its true author out of many candidate authors while an AO solution aims to modify t to hide its true authorship. Traditionally, the notion of authorship and its accompanying privacy concern is only toward human authors. However, in recent years, due to the explosive advancements in Neural Text Generation (NTG) techniques in NLP, capable of synthesizing human-quality open-ended texts (so-called "neural texts"), one has to now consider authorships by humans, machines, or their combination. Due to the implications and potential threats of neural texts when used maliciously, it has become critical to understand the limitations of traditional AA/AO solutions and develop novel AA/AO solutions in dealing with neural texts. In this survey, therefore, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.Comment: Accepted at ACM SIGKDD Explorations, Vol. 25, June 202

arXiv.org e-Print Archive

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Author: Avrahami Thi
Balashankar Ananth
Beirami Ahmad
Beutel Alex
Chen Jilin
Sinha Aradhana
Publication venue
Publication date: 25/10/2023
Field of study

Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%

\,\to\,

50.1%) and on two future unseen rounds of human generated attacks (32.5%

\,\to\,

43.4%, and 29.4%

\,\to\,

40.2%). In hate speech detection, we see AUC gains on current attacks (0.76

\to

0.84) and a future round (0.77

\to

0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness

arXiv.org e-Print Archive

Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness

Author: Mao Wenji
Zhao Jiahao
Publication venue
Publication date: 26/10/2022
Field of study

Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems. When imperceptible perturbations are added to raw input text, the performance of a deep learning model may drop dramatically under attacks. Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training. Thus in this paper, we tackle the adversarial robustness challenge from the view of disentangled representation learning, which is able to explicitly disentangle robust and non-robust features in text. Specifically, inspired by the variation of information (VI) in information theory, we derive a disentangled learning objective composed of mutual information to represent both the semantic representativeness of latent embeddings and differentiation of robust and non-robust features. On the basis of this, we design a disentangled learning network to estimate these mutual information. Experiments on text classification and entailment tasks show that our method significantly outperforms the representative methods under adversarial attacks, indicating that discarding non-robust features is critical for improving adversarial robustness

arXiv.org e-Print Archive

혐오 발언 분류 모델의 거짓 양성 편향 진단 및 개선 연구

Author: 오주현
Publication venue: 서울대학교 대학원
Publication date: 01/02/2022
Field of study

학위논문(석사) -- 서울대학교대학원 : 인문대학 언어학과, 2022.2. 신효필.As the damage caused by hate speech in anonymous online spaces has been growing significantly, research on the detection of hate speech is being actively conducted. Recently, deep learning-based hate speech classifiers have shown great performance, but they tend to fail to generalize on out-of-domain data. I focus on the problem of False Positive detection and build adversarial tests sets of three different domains to diagnose this issue. I illustrate that a BERT-based classification model trained with existing Korean hate speech corpus exhibits False Positives due to over-sensitivity to specific words that have high correlations with hate speech in training datasets. Next, I present two different approaches to address the problem: a data-centric approach that adds data to correct the imbalance of training datasets and a model-centric approach that regularizes the model using post-hoc explanations. Both methods show improvement in reducing False Positives without compromising overall model quality. In addition, I show that strategically adding negative samples from a domain similar to a test set can be a cost-efficient way of greatly reducing false positives. Using Sampling and Occlusion (Jin et al., 2020) explanation, I qualitatively demonstrate that both approaches help model better utilize contextual information.온라인 등 익명 공간에서의 혐오 발언(Hate speech)으로 인한 피해가 커져감에 따라, 혐오 발언 분류 및 검출에 관한 연구가 활발히 진행되고 있다. 최근 딥러닝 기반의 혐오 발언 분류기가 좋은 성능을 보이고 있지만, 학습 도메인 밖(out-of-domain) 데이터로 일반화함에 있어서는 어려움을 겪고 있다. 본 연구는 모델이 거짓 양성(False Positive)을 검출해내는 문제에 초점을 두고, 해당 문제를 진단하기 위해 세 가지 서로 다른 도메인의(domain)의 대립적(adversarial) 데이터를 활용하여 테스트셋을 만든다. 이를 통해 기존의 한국어 혐오 표현 데이터셋을 학습한 BERT 기반의 분류 모델이 학습 데이터 상에서 혐오 표현과 높은 상관관계를 가지는 특정 단어들에 민감하게 반응하여 거짓 양성(False Positive) 결과를 예측하는 현상을 보인다. 다음으로, 이를 해결하기 위한 두 가지 방법을 제시한다. 학습 데이터셋의 불균형을 수정하기 위한 데이터를 추가하는 데이터 중점(data-centric) 방법과 특정 단어들에 대한 모델의 사후 설명(post-hoc explanation)을 활용하여 모델을 정규화(regularize) 하는 모델 중점(model-centric) 방법을 적용하고, 두 접근 방법 모두 전반적인 모델 성능을 해치지 않으며 거짓 양성의 비율을 줄일 수 있음을 보인다. 또한, 테스트 도메인의 특성을 알고 있을 경우, 유사한 도메인에서 학습 데이터의 불균형 수정을 위한 샘플 추가를 통해 적은 비용으로 모델의 거짓양성을 큰 폭으로 줄일 수 있음을 보인다. 또한, Samping and Occlusion (Jin et al., 2020) 설명을 통해 두 접근 방식 모두에서 문맥 정보를 더 잘 활용하게 됨을 정성적으로 확인한다.ABSTRACT I TABLE OF CONTENTS III LIST OF FIGURES IV LIST OF TABLES V CHAPTER 1. INTRODUCTION 1 1.1. HATE SPEECH DETECTION 1 1.2. FALSE POSITIVES IN HATE SPEECH DETECTION 4 1.3. PURPOSE OF RESEARCH 6 CHAPTER 2. BACKGROUND 9 2.1. DOMAIN ADAPTATION 9 2.2. MEASURING AND MITIGATING FALSE POSITIVE BIAS OF HATE SPEECH CLASSIFIER 10 2.2.1 Measuring Model bias on social identifiers 11 2.2.2 Mitigating Model bias on social identifiers 13 CHAPTER 3. DATASET 17 CHAPTER 4. QUANTIFYING BIAS 20 4.1 BASELINE MODEL 20 4.2 SELECTING NEUTRAL KEYWORDS 21 4.3 TEST DATASETS 26 4.4 QUANTIFYING BIAS OF THE BASELINE MODEL 31 CHAPTER 5. EXPERIMENTS 33 5.1 BIAS MITIGATION 33 5.1.1. Bias mitigation through train data augmentation 33 5.1.2. Model Regularization using SOC explanation 35 5.2 RESULT 36 5.2.1. Evaluation Metric 36 5.2.2. Experimental Results 36 5.2.3. Visualizing Effects of Mitigation 39 CHAPTER 6. CONCLUSION 46 REFERENCES 48 국문초록 52석

SNU Open Repository and Archive

A Survey on Out-of-Distribution Evaluation of Neural NLP Models

Author: Buntine Wray
Gao Shang
Li Xinzhe
Liu Ming
Publication venue
Publication date: 27/06/2023
Field of study

Adversarial robustness, domain generalization and dataset biases are three active lines of research contributing to out-of-distribution (OOD) evaluation on neural NLP models. However, a comprehensive, integrated discussion of the three research lines is still lacking in the literature. In this survey, we 1) compare the three lines of research under a unifying definition; 2) summarize the data-generating processes and evaluation protocols for each line of research; and 3) emphasize the challenges and opportunities for future work

arXiv.org e-Print Archive