826 research outputs found

    Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods

    Full text link
    Machine generated text is increasingly difficult to distinguish from human authored text. Powerful open-source models are freely available, and user-friendly tools that democratize access to generative models are proliferating. ChatGPT, which was released shortly after the first preprint of this survey, epitomizes these trends. The great potential of state-of-the-art natural language generation (NLG) systems is tempered by the multitude of avenues for abuse. Detection of machine generated text is a key countermeasure for reducing abuse of NLG models, with significant technical challenges and numerous open problems. We provide a survey that includes both 1) an extensive analysis of threat models posed by contemporary NLG systems, and 2) the most complete review of machine generated text detection methods to date. This survey places machine generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models, and ensuring detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.Comment: Manuscript submitted to ACM Special Session on Trustworthy AI. 2022/11/19 - Updated reference

    Improving both domain robustness and domain adaptability in machine translation

    Full text link
    We address two problems of domain adaptation in neural machine translation. First, we want to reach domain robustness, i.e., good quality of both domains from the training data, and domains unseen in the training data. Second, we want our systems to be adaptive, i.e., making it possible to finetune systems with just hundreds of in-domain parallel sentences. In this paper, we introduce a novel combination of two previous approaches, word adaptive modelling, which addresses domain robustness, and meta-learning, which addresses domain adaptability, and we present empirical results showing that our new combination improves both of these properties

    Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective

    Full text link
    Two interlocking research questions of growing interest and importance in privacy research are Authorship Attribution (AA) and Authorship Obfuscation (AO). Given an artifact, especially a text t in question, an AA solution aims to accurately attribute t to its true author out of many candidate authors while an AO solution aims to modify t to hide its true authorship. Traditionally, the notion of authorship and its accompanying privacy concern is only toward human authors. However, in recent years, due to the explosive advancements in Neural Text Generation (NTG) techniques in NLP, capable of synthesizing human-quality open-ended texts (so-called "neural texts"), one has to now consider authorships by humans, machines, or their combination. Due to the implications and potential threats of neural texts when used maliciously, it has become critical to understand the limitations of traditional AA/AO solutions and develop novel AA/AO solutions in dealing with neural texts. In this survey, therefore, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.Comment: Accepted at ACM SIGKDD Explorations, Vol. 25, June 202

    Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

    Full text link
    Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%โ€‰โ†’โ€‰\,\to\,50.1%) and on two future unseen rounds of human generated attacks (32.5%โ€‰โ†’โ€‰\,\to\,43.4%, and 29.4%โ€‰โ†’โ€‰\,\to\,40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 โ†’\to 0.84) and a future round (0.77 โ†’\to 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness

    Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness

    Full text link
    Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems. When imperceptible perturbations are added to raw input text, the performance of a deep learning model may drop dramatically under attacks. Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training. Thus in this paper, we tackle the adversarial robustness challenge from the view of disentangled representation learning, which is able to explicitly disentangle robust and non-robust features in text. Specifically, inspired by the variation of information (VI) in information theory, we derive a disentangled learning objective composed of mutual information to represent both the semantic representativeness of latent embeddings and differentiation of robust and non-robust features. On the basis of this, we design a disentangled learning network to estimate these mutual information. Experiments on text classification and entailment tasks show that our method significantly outperforms the representative methods under adversarial attacks, indicating that discarding non-robust features is critical for improving adversarial robustness

    ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ ํŽธํ–ฅ ์ง„๋‹จ ๋ฐ ๊ฐœ์„  ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2022.2. ์‹ ํšจํ•„.As the damage caused by hate speech in anonymous online spaces has been growing significantly, research on the detection of hate speech is being actively conducted. Recently, deep learning-based hate speech classifiers have shown great performance, but they tend to fail to generalize on out-of-domain data. I focus on the problem of False Positive detection and build adversarial tests sets of three different domains to diagnose this issue. I illustrate that a BERT-based classification model trained with existing Korean hate speech corpus exhibits False Positives due to over-sensitivity to specific words that have high correlations with hate speech in training datasets. Next, I present two different approaches to address the problem: a data-centric approach that adds data to correct the imbalance of training datasets and a model-centric approach that regularizes the model using post-hoc explanations. Both methods show improvement in reducing False Positives without compromising overall model quality. In addition, I show that strategically adding negative samples from a domain similar to a test set can be a cost-efficient way of greatly reducing false positives. Using Sampling and Occlusion (Jin et al., 2020) explanation, I qualitatively demonstrate that both approaches help model better utilize contextual information.์˜จ๋ผ์ธ ๋“ฑ ์ต๋ช… ๊ณต๊ฐ„์—์„œ์˜ ํ˜์˜ค ๋ฐœ์–ธ(Hate speech)์œผ๋กœ ์ธํ•œ ํ”ผํ•ด๊ฐ€ ์ปค์ ธ๊ฐ์— ๋”ฐ๋ผ, ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜ ๋ฐ ๊ฒ€์ถœ์— ๊ด€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์ง€๋งŒ, ํ•™์Šต ๋„๋ฉ”์ธ ๋ฐ–(out-of-domain) ๋ฐ์ดํ„ฐ๋กœ ์ผ๋ฐ˜ํ™”ํ•จ์— ์žˆ์–ด์„œ๋Š” ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๋ชจ๋ธ์ด ๊ฑฐ์ง“ ์–‘์„ฑ(False Positive)์„ ๊ฒ€์ถœํ•ด๋‚ด๋Š” ๋ฌธ์ œ์— ์ดˆ์ ์„ ๋‘๊ณ , ํ•ด๋‹น ๋ฌธ์ œ๋ฅผ ์ง„๋‹จํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ์„œ๋กœ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์˜(domain)์˜ ๋Œ€๋ฆฝ์ (adversarial) ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ์…‹์„ ๋งŒ๋“ ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด์˜ ํ•œ๊ตญ์–ด ํ˜์˜ค ํ‘œํ˜„ ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•œ BERT ๊ธฐ๋ฐ˜์˜ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ ์ƒ์—์„œ ํ˜์˜ค ํ‘œํ˜„๊ณผ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ํŠน์ • ๋‹จ์–ด๋“ค์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•˜์—ฌ ๊ฑฐ์ง“ ์–‘์„ฑ(False Positive) ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ˜„์ƒ์„ ๋ณด์ธ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถˆ๊ท ํ˜•์„ ์ˆ˜์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ค‘์ (data-centric) ๋ฐฉ๋ฒ•๊ณผ ํŠน์ • ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์‚ฌํ›„ ์„ค๋ช…(post-hoc explanation)์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ •๊ทœํ™”(regularize) ํ•˜๋Š” ๋ชจ๋ธ ์ค‘์ (model-centric) ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๊ณ , ๋‘ ์ ‘๊ทผ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ์ „๋ฐ˜์ ์ธ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ•ด์น˜์ง€ ์•Š์œผ๋ฉฐ ๊ฑฐ์ง“ ์–‘์„ฑ์˜ ๋น„์œจ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, ํ…Œ์ŠคํŠธ ๋„๋ฉ”์ธ์˜ ํŠน์„ฑ์„ ์•Œ๊ณ  ์žˆ์„ ๊ฒฝ์šฐ, ์œ ์‚ฌํ•œ ๋„๋ฉ”์ธ์—์„œ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜• ์ˆ˜์ •์„ ์œ„ํ•œ ์ƒ˜ํ”Œ ์ถ”๊ฐ€๋ฅผ ํ†ตํ•ด ์ ์€ ๋น„์šฉ์œผ๋กœ ๋ชจ๋ธ์˜ ๊ฑฐ์ง“์–‘์„ฑ์„ ํฐ ํญ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, Samping and Occlusion (Jin et al., 2020) ์„ค๋ช…์„ ํ†ตํ•ด ๋‘ ์ ‘๊ทผ ๋ฐฉ์‹ ๋ชจ๋‘์—์„œ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋” ์ž˜ ํ™œ์šฉํ•˜๊ฒŒ ๋จ์„ ์ •์„ฑ์ ์œผ๋กœ ํ™•์ธํ•œ๋‹ค.ABSTRACT I TABLE OF CONTENTS III LIST OF FIGURES IV LIST OF TABLES V CHAPTER 1. INTRODUCTION 1 1.1. HATE SPEECH DETECTION 1 1.2. FALSE POSITIVES IN HATE SPEECH DETECTION 4 1.3. PURPOSE OF RESEARCH 6 CHAPTER 2. BACKGROUND 9 2.1. DOMAIN ADAPTATION 9 2.2. MEASURING AND MITIGATING FALSE POSITIVE BIAS OF HATE SPEECH CLASSIFIER 10 2.2.1 Measuring Model bias on social identifiers 11 2.2.2 Mitigating Model bias on social identifiers 13 CHAPTER 3. DATASET 17 CHAPTER 4. QUANTIFYING BIAS 20 4.1 BASELINE MODEL 20 4.2 SELECTING NEUTRAL KEYWORDS 21 4.3 TEST DATASETS 26 4.4 QUANTIFYING BIAS OF THE BASELINE MODEL 31 CHAPTER 5. EXPERIMENTS 33 5.1 BIAS MITIGATION 33 5.1.1. Bias mitigation through train data augmentation 33 5.1.2. Model Regularization using SOC explanation 35 5.2 RESULT 36 5.2.1. Evaluation Metric 36 5.2.2. Experimental Results 36 5.2.3. Visualizing Effects of Mitigation 39 CHAPTER 6. CONCLUSION 46 REFERENCES 48 ๊ตญ๋ฌธ์ดˆ๋ก 52์„

    A Survey on Out-of-Distribution Evaluation of Neural NLP Models

    Full text link
    Adversarial robustness, domain generalization and dataset biases are three active lines of research contributing to out-of-distribution (OOD) evaluation on neural NLP models. However, a comprehensive, integrated discussion of the three research lines is still lacking in the literature. In this survey, we 1) compare the three lines of research under a unifying definition; 2) summarize the data-generating processes and evaluation protocols for each line of research; and 3) emphasize the challenges and opportunities for future work
    • โ€ฆ
    corecore