826 research outputs found
Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods
Machine generated text is increasingly difficult to distinguish from human
authored text. Powerful open-source models are freely available, and
user-friendly tools that democratize access to generative models are
proliferating. ChatGPT, which was released shortly after the first preprint of
this survey, epitomizes these trends. The great potential of state-of-the-art
natural language generation (NLG) systems is tempered by the multitude of
avenues for abuse. Detection of machine generated text is a key countermeasure
for reducing abuse of NLG models, with significant technical challenges and
numerous open problems. We provide a survey that includes both 1) an extensive
analysis of threat models posed by contemporary NLG systems, and 2) the most
complete review of machine generated text detection methods to date. This
survey places machine generated text within its cybersecurity and social
context, and provides strong guidance for future work addressing the most
critical threat models, and ensuring detection systems themselves demonstrate
trustworthiness through fairness, robustness, and accountability.Comment: Manuscript submitted to ACM Special Session on Trustworthy AI.
2022/11/19 - Updated reference
Improving both domain robustness and domain adaptability in machine translation
We address two problems of domain adaptation in neural machine translation.
First, we want to reach domain robustness, i.e., good quality of both domains
from the training data, and domains unseen in the training data. Second, we
want our systems to be adaptive, i.e., making it possible to finetune systems
with just hundreds of in-domain parallel sentences. In this paper, we introduce
a novel combination of two previous approaches, word adaptive modelling, which
addresses domain robustness, and meta-learning, which addresses domain
adaptability, and we present empirical results showing that our new combination
improves both of these properties
Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective
Two interlocking research questions of growing interest and importance in
privacy research are Authorship Attribution (AA) and Authorship Obfuscation
(AO). Given an artifact, especially a text t in question, an AA solution aims
to accurately attribute t to its true author out of many candidate authors
while an AO solution aims to modify t to hide its true authorship.
Traditionally, the notion of authorship and its accompanying privacy concern is
only toward human authors. However, in recent years, due to the explosive
advancements in Neural Text Generation (NTG) techniques in NLP, capable of
synthesizing human-quality open-ended texts (so-called "neural texts"), one has
to now consider authorships by humans, machines, or their combination. Due to
the implications and potential threats of neural texts when used maliciously,
it has become critical to understand the limitations of traditional AA/AO
solutions and develop novel AA/AO solutions in dealing with neural texts. In
this survey, therefore, we make a comprehensive review of recent literature on
the attribution and obfuscation of neural text authorship from a Data Mining
perspective, and share our view on their limitations and promising research
directions.Comment: Accepted at ACM SIGKDD Explorations, Vol. 25, June 202
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Real-world natural language processing systems need to be robust to human
adversaries. Collecting examples of human adversaries for training is an
effective but expensive solution. On the other hand, training on synthetic
attacks with small perturbations - such as word-substitution - does not
actually improve robustness to human adversaries. In this paper, we propose an
adversarial training framework that uses limited human adversarial examples to
generate more useful adversarial examples at scale. We demonstrate the
advantages of this system on the ANLI and hate speech detection benchmark
datasets - both collected via an iterative, adversarial
human-and-model-in-the-loop procedure. Compared to training only on observed
human attacks, also training on our synthetic adversarial examples improves
model robustness to future rounds. In ANLI, we see accuracy gains on the
current set of attacks (44.1%50.1%) and on two future unseen rounds of
human generated attacks (32.5%43.4%, and 29.4%40.2%). In hate
speech detection, we see AUC gains on current attacks (0.76 0.84) and a
future round (0.77 0.79). Attacks from methods that do not learn the
distribution of existing human adversaries, meanwhile, degrade robustness
Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness
Adversarial vulnerability remains a major obstacle to constructing reliable
NLP systems. When imperceptible perturbations are added to raw input text, the
performance of a deep learning model may drop dramatically under attacks.
Recent work argues the adversarial vulnerability of the model is caused by the
non-robust features in supervised training. Thus in this paper, we tackle the
adversarial robustness challenge from the view of disentangled representation
learning, which is able to explicitly disentangle robust and non-robust
features in text. Specifically, inspired by the variation of information (VI)
in information theory, we derive a disentangled learning objective composed of
mutual information to represent both the semantic representativeness of latent
embeddings and differentiation of robust and non-robust features. On the basis
of this, we design a disentangled learning network to estimate these mutual
information. Experiments on text classification and entailment tasks show that
our method significantly outperforms the representative methods under
adversarial attacks, indicating that discarding non-robust features is critical
for improving adversarial robustness
ํ์ค ๋ฐ์ธ ๋ถ๋ฅ ๋ชจ๋ธ์ ๊ฑฐ์ง ์์ฑ ํธํฅ ์ง๋จ ๋ฐ ๊ฐ์ ์ฐ๊ตฌ
ํ์๋
ผ๋ฌธ(์์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ์ธ๋ฌธ๋ํ ์ธ์ดํ๊ณผ, 2022.2. ์ ํจํ.As the damage caused by hate speech in anonymous online spaces has been growing significantly, research on the detection of hate speech is being actively conducted. Recently, deep learning-based hate speech classifiers have shown great performance, but they tend to fail to generalize on out-of-domain data. I focus on the problem of False Positive detection and build adversarial tests sets of three different domains to diagnose this issue. I illustrate that a BERT-based classification model trained with existing Korean hate speech corpus exhibits False Positives due to over-sensitivity to specific words that have high correlations with hate speech in training datasets. Next, I present two different approaches to address the problem: a data-centric approach that adds data to correct the imbalance of training datasets and a model-centric approach that regularizes the model using post-hoc explanations. Both methods show improvement in reducing False Positives without compromising overall model quality. In addition, I show that strategically adding negative samples from a domain similar to a test set can be a cost-efficient way of greatly reducing false positives. Using Sampling and Occlusion (Jin et al., 2020) explanation, I qualitatively demonstrate that both approaches help model better utilize contextual information.์จ๋ผ์ธ ๋ฑ ์ต๋ช
๊ณต๊ฐ์์์ ํ์ค ๋ฐ์ธ(Hate speech)์ผ๋ก ์ธํ ํผํด๊ฐ ์ปค์ ธ๊ฐ์ ๋ฐ๋ผ, ํ์ค ๋ฐ์ธ ๋ถ๋ฅ ๋ฐ ๊ฒ์ถ์ ๊ดํ ์ฐ๊ตฌ๊ฐ ํ๋ฐํ ์งํ๋๊ณ ์๋ค. ์ต๊ทผ ๋ฅ๋ฌ๋ ๊ธฐ๋ฐ์ ํ์ค ๋ฐ์ธ ๋ถ๋ฅ๊ธฐ๊ฐ ์ข์ ์ฑ๋ฅ์ ๋ณด์ด๊ณ ์์ง๋ง, ํ์ต ๋๋ฉ์ธ ๋ฐ(out-of-domain) ๋ฐ์ดํฐ๋ก ์ผ๋ฐํํจ์ ์์ด์๋ ์ด๋ ค์์ ๊ฒช๊ณ ์๋ค. ๋ณธ ์ฐ๊ตฌ๋ ๋ชจ๋ธ์ด ๊ฑฐ์ง ์์ฑ(False Positive)์ ๊ฒ์ถํด๋ด๋ ๋ฌธ์ ์ ์ด์ ์ ๋๊ณ , ํด๋น ๋ฌธ์ ๋ฅผ ์ง๋จํ๊ธฐ ์ํด ์ธ ๊ฐ์ง ์๋ก ๋ค๋ฅธ ๋๋ฉ์ธ์(domain)์ ๋๋ฆฝ์ (adversarial) ๋ฐ์ดํฐ๋ฅผ ํ์ฉํ์ฌ ํ
์คํธ์
์ ๋ง๋ ๋ค. ์ด๋ฅผ ํตํด ๊ธฐ์กด์ ํ๊ตญ์ด ํ์ค ํํ ๋ฐ์ดํฐ์
์ ํ์ตํ BERT ๊ธฐ๋ฐ์ ๋ถ๋ฅ ๋ชจ๋ธ์ด ํ์ต ๋ฐ์ดํฐ ์์์ ํ์ค ํํ๊ณผ ๋์ ์๊ด๊ด๊ณ๋ฅผ ๊ฐ์ง๋ ํน์ ๋จ์ด๋ค์ ๋ฏผ๊ฐํ๊ฒ ๋ฐ์ํ์ฌ ๊ฑฐ์ง ์์ฑ(False Positive) ๊ฒฐ๊ณผ๋ฅผ ์์ธกํ๋ ํ์์ ๋ณด์ธ๋ค. ๋ค์์ผ๋ก, ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ ๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ํ์ต ๋ฐ์ดํฐ์
์ ๋ถ๊ท ํ์ ์์ ํ๊ธฐ ์ํ ๋ฐ์ดํฐ๋ฅผ ์ถ๊ฐํ๋ ๋ฐ์ดํฐ ์ค์ (data-centric) ๋ฐฉ๋ฒ๊ณผ ํน์ ๋จ์ด๋ค์ ๋ํ ๋ชจ๋ธ์ ์ฌํ ์ค๋ช
(post-hoc explanation)์ ํ์ฉํ์ฌ ๋ชจ๋ธ์ ์ ๊ทํ(regularize) ํ๋ ๋ชจ๋ธ ์ค์ (model-centric) ๋ฐฉ๋ฒ์ ์ ์ฉํ๊ณ , ๋ ์ ๊ทผ ๋ฐฉ๋ฒ ๋ชจ๋ ์ ๋ฐ์ ์ธ ๋ชจ๋ธ ์ฑ๋ฅ์ ํด์น์ง ์์ผ๋ฉฐ ๊ฑฐ์ง ์์ฑ์ ๋น์จ์ ์ค์ผ ์ ์์์ ๋ณด์ธ๋ค. ๋ํ, ํ
์คํธ ๋๋ฉ์ธ์ ํน์ฑ์ ์๊ณ ์์ ๊ฒฝ์ฐ, ์ ์ฌํ ๋๋ฉ์ธ์์ ํ์ต ๋ฐ์ดํฐ์ ๋ถ๊ท ํ ์์ ์ ์ํ ์ํ ์ถ๊ฐ๋ฅผ ํตํด ์ ์ ๋น์ฉ์ผ๋ก ๋ชจ๋ธ์ ๊ฑฐ์ง์์ฑ์ ํฐ ํญ์ผ๋ก ์ค์ผ ์ ์์์ ๋ณด์ธ๋ค. ๋ํ, Samping and Occlusion (Jin et al., 2020) ์ค๋ช
์ ํตํด ๋ ์ ๊ทผ ๋ฐฉ์ ๋ชจ๋์์ ๋ฌธ๋งฅ ์ ๋ณด๋ฅผ ๋ ์ ํ์ฉํ๊ฒ ๋จ์ ์ ์ฑ์ ์ผ๋ก ํ์ธํ๋ค.ABSTRACT I
TABLE OF CONTENTS III
LIST OF FIGURES IV
LIST OF TABLES V
CHAPTER 1. INTRODUCTION 1
1.1. HATE SPEECH DETECTION 1
1.2. FALSE POSITIVES IN HATE SPEECH DETECTION 4
1.3. PURPOSE OF RESEARCH 6
CHAPTER 2. BACKGROUND 9
2.1. DOMAIN ADAPTATION 9
2.2. MEASURING AND MITIGATING FALSE POSITIVE BIAS OF HATE SPEECH CLASSIFIER 10
2.2.1 Measuring Model bias on social identifiers 11
2.2.2 Mitigating Model bias on social identifiers 13
CHAPTER 3. DATASET 17
CHAPTER 4. QUANTIFYING BIAS 20
4.1 BASELINE MODEL 20
4.2 SELECTING NEUTRAL KEYWORDS 21
4.3 TEST DATASETS 26
4.4 QUANTIFYING BIAS OF THE BASELINE MODEL 31
CHAPTER 5. EXPERIMENTS 33
5.1 BIAS MITIGATION 33
5.1.1. Bias mitigation through train data augmentation 33
5.1.2. Model Regularization using SOC explanation 35
5.2 RESULT 36
5.2.1. Evaluation Metric 36
5.2.2. Experimental Results 36
5.2.3. Visualizing Effects of Mitigation 39
CHAPTER 6. CONCLUSION 46
REFERENCES 48
๊ตญ๋ฌธ์ด๋ก 52์
A Survey on Out-of-Distribution Evaluation of Neural NLP Models
Adversarial robustness, domain generalization and dataset biases are three
active lines of research contributing to out-of-distribution (OOD) evaluation
on neural NLP models. However, a comprehensive, integrated discussion of the
three research lines is still lacking in the literature. In this survey, we 1)
compare the three lines of research under a unifying definition; 2) summarize
the data-generating processes and evaluation protocols for each line of
research; and 3) emphasize the challenges and opportunities for future work
- โฆ