1,842 research outputs found
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
The rise in malicious usage of large language models, such as fake content
creation and academic plagiarism, has motivated the development of approaches
that identify AI-generated text, including those based on watermarking or
outlier detection. However, the robustness of these detection algorithms to
paraphrases of AI-generated text remains unclear. To stress test these
detectors, we build a 11B parameter paraphrase generation model (DIPPER) that
can paraphrase paragraphs, condition on surrounding context, and control
lexical diversity and content reordering. Using DIPPER to paraphrase text
generated by three large language models (including GPT3.5-davinci-003)
successfully evades several detectors, including watermarking, GPTZero,
DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection
accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of
1%), without appreciably modifying the input semantics.
To increase the robustness of AI-generated text detection to paraphrase
attacks, we introduce a simple defense that relies on retrieving
semantically-similar generations and must be maintained by a language model API
provider. Given a candidate text, our algorithm searches a database of
sequences previously generated by the API, looking for sequences that match the
candidate text within a certain threshold. We empirically verify our defense
using a database of 15M generations from a fine-tuned T5-XXL model and find
that it can detect 80% to 97% of paraphrased generations across different
settings while only classifying 1% of human-written sequences as AI-generated.
We open-source our models, code and data.Comment: NeurIPS 2023 camera ready (32 pages). Code, models, data available in
https://github.com/martiansideofthemoon/ai-detection-paraphrase
Semantic relations between sentences: from lexical to linguistically inspired semantic features and beyond
This thesis is concerned with the identification of semantic equivalence between pairs of natural language
sentences, by studying and computing models to address Natural Language Processing tasks where some
form of semantic equivalence is assessed. In such tasks, given two sentences, our models output either
a class label, corresponding to the semantic relation between the sentences, based on a predefined set
of semantic relations, or a continuous score, corresponding to their similarity on a predefined scale. The
former setup corresponds to the tasks of Paraphrase Identification and Natural Language Inference, while
the latter corresponds to the task of Semantic Textual Similarity.
We present several models for English and Portuguese, where various types of features are considered,
for instance based on distances between alternative representations of each sentence, following lexical
and semantic frameworks, or embeddings from pre-trained Bidirectional Encoder Representations from
Transformers models. For English, a new set of semantic features is proposed, from the formal semantic
representation of Discourse Representation Structure. In Portuguese, suitable corpora are scarce and formal
semantic representations are unavailable, hence an evaluation of currently available features and corpora is
conducted, following the modelling setup employed for English.
Competitive results are achieved on all tasks, for both English and Portuguese, particularly when considering
that our models are based on generally available tools and technologies, and that all features and models are
suitable for computation in most modern computers, except for those based on embeddings. In particular,
for English, our semantic features from DRS are able to improve the performance of other models, when
integrated in the feature set of such models, and state of the art results are achieved for Portuguese, with
models based on fine tuning embeddings to a specific task; Sumário:
Relações semânticas entre frases: de aspectos
lexicais a aspectos semânticos inspirados em
linguística e além destes
Esta tese é dedicada à identificação de equivalência semântica entre frases em língua natural, através do
estudo e computação de modelos destinados a tarefas de Processamento de Linguagem Natural relacionadas
com alguma forma de equivalência semântica. Em tais tarefas, a partir de duas frases, os nossos modelos
produzem uma etiqueta de classificação, que corresponde à relação semântica entre as frases, baseada
num conjunto predefinido de possíveis relações semânticas, ou um valor contínuo, que corresponde à
similaridade das frases numa escala predefinida. A primeira configuração mencionada corresponde às tarefas
de Identificação de Paráfrases e de Inferência em Língua Natural, enquanto que a última configuração
mencionada corresponde à tarefa de Similaridade Semântica em Texto.
Apresentamos diversos modelos para Inglês e Português, onde vários tipos de aspectos são considerados,
por exemplo baseados em distâncias entre representações alternativas para cada frase, seguindo formalismos
semânticos e lexicais, ou vectores contextuais de modelos previamente treinados com Representações
Codificadas Bidirecionalmente a partir de Transformadores. Para Inglês, propomos um novo conjunto de
aspectos semânticos, a partir da representação formal de semântica em Estruturas de Representação de
Discurso. Para Português, os conjuntos de dados apropriados são escassos e não estão disponíveis representações
formais de semântica, então implementámos uma avaliação de aspectos actualmente disponíveis,
seguindo a configuração de modelos aplicada para Inglês.
Obtivemos resultados competitivos em todas as tarefas, em Inglês e Português, particularmente considerando
que os nossos modelos são baseados em ferramentas e tecnologias disponíveis, e que todos
os nossos aspectos e modelos são apropriados para computação na maioria dos computadores modernos,
excepto os modelos baseados em vectores contextuais. Em particular, para Inglês, os nossos aspectos
semânticos a partir de Estruturas de Representação de Discurso melhoram o desempenho de outros modelos,
quando integrados no conjunto de aspectos de tais modelos, e obtivemos resultados estado da arte
para Português, com modelos baseados em afinação de vectores contextuais para certa tarefa
On the Reliability of Watermarks for Large Language Models
As LLMs become commonplace, machine-generated text has the potential to flood
the internet with spam, social media bots, and valueless content. Watermarking
is a simple and effective strategy for mitigating such harms by enabling the
detection and documentation of LLM-generated text. Yet a crucial question
remains: How reliable is watermarking in realistic settings in the wild? There,
watermarked text may be modified to suit a user's needs, or entirely rewritten
to avoid detection.
We study the robustness of watermarked text after it is re-written by humans,
paraphrased by a non-watermarked LLM, or mixed into a longer hand-written
document. We find that watermarks remain detectable even after human and
machine paraphrasing. While these attacks dilute the strength of the watermark,
paraphrases are statistically likely to leak n-grams or even longer fragments
of the original text, resulting in high-confidence detections when enough
tokens are observed. For example, after strong human paraphrasing the watermark
is detectable after observing 800 tokens on average, when setting a 1e-5 false
positive rate. We also consider a range of new detection schemes that are
sensitive to short spans of watermarked text embedded inside a large document,
and we compare the robustness of watermarking to other kinds of detectors.Comment: 14 pages in the main body. Code is available at
https://github.com/jwkirchenbauer/lm-watermarkin
BAE: BERT-based Adversarial Examples for Text Classification
Modern text classification models are susceptible to adversarial examples,
perturbed versions of the original text indiscernible by humans which get
misclassified by the model. Recent works in NLP use rule-based synonym
replacement strategies to generate adversarial examples. These strategies can
lead to out-of-context and unnaturally complex token replacements, which are
easily identifiable by humans. We present BAE, a black box attack for
generating adversarial examples using contextual perturbations from a BERT
masked language model. BAE replaces and inserts tokens in the original text by
masking a portion of the text and leveraging the BERT-MLM to generate
alternatives for the masked tokens. Through automatic and human evaluations, we
show that BAE performs a stronger attack, in addition to generating adversarial
examples with improved grammaticality and semantic coherence as compared to
prior work.Comment: Accepted at EMNLP 2020 Main Conferenc
Questionnaire integration system based on question classification and short text semantic textual similarity, A
2018 Fall.Includes bibliographical references.Semantic integration from heterogeneous sources involves a series of NLP tasks. Existing re- search has focused mainly on measuring two paired sentences. However, to find possible identical texts between two datasets, the sentences are not paired. To avoid pair-wise comparison, this thesis proposed a semantic similarity measuring system equipped with a precategorization module. It applies a hybrid question classification module, which subdivides all texts to coarse categories. The sentences are then paired from these subcategories. The core task is to detect identical texts between two sentences, which relates to the semantic textual similarity task in the NLP field. We built a short text semantic textual similarity measuring module. It combined conventional NLP techniques, including both semantic and syntactic features, with a Recurrent Convolutional Neural Network to accomplish an ensemble model. We also conducted a set of empirical evaluations. The results show that our system possesses a degree of generalization ability, and it performs well on heterogeneous sources
A Survey on Backdoor Attack and Defense in Natural Language Processing
Deep learning is becoming increasingly popular in real-life applications,
especially in natural language processing (NLP). Users often choose training
outsourcing or adopt third-party data and models due to data and computation
resources being limited. In such a situation, training data and models are
exposed to the public. As a result, attackers can manipulate the training
process to inject some triggers into the model, which is called backdoor
attack. Backdoor attack is quite stealthy and difficult to be detected because
it has little inferior influence on the model's performance for the clean
samples. To get a precise grasp and understanding of this problem, in this
paper, we conduct a comprehensive review of backdoor attacks and defenses in
the field of NLP. Besides, we summarize benchmark datasets and point out the
open issues to design credible systems to defend against backdoor attacks.Comment: 12 pages, QRS202
- …