8 research outputs found
Large language models effectively leverage document-level context for literary translation, but critical errors persist
Large language models (LLMs) are competitive with the state of the art on a
wide range of sentence-level translation datasets. However, their ability to
translate paragraphs and documents remains unexplored because evaluation in
these settings is costly and difficult. We show through a rigorous human
evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an
entire literary paragraph (e.g., from a novel) at once results in
higher-quality translations than standard sentence-by-sentence translation
across 18 linguistically-diverse language pairs (e.g., translating into and out
of Japanese, Polish, and English). Our evaluation, which took approximately 350
hours of effort for annotation and analysis, is conducted by hiring translators
fluent in both the source and target language and asking them to provide both
span-level error annotations as well as preference judgments of which system's
translations are better. We observe that discourse-level LLM translators commit
fewer mistranslations, grammar errors, and stylistic inconsistencies than
sentence-level approaches. With that said, critical errors still abound,
including occasional content omissions, and a human translator's intervention
remains necessary to ensure that the author's voice remains intact. We publicly
release our dataset and error annotations to spur future research on evaluation
of document-level literary translation.Comment: preprint (31 pages
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
The rise in malicious usage of large language models, such as fake content
creation and academic plagiarism, has motivated the development of approaches
that identify AI-generated text, including those based on watermarking or
outlier detection. However, the robustness of these detection algorithms to
paraphrases of AI-generated text remains unclear. To stress test these
detectors, we build a 11B parameter paraphrase generation model (DIPPER) that
can paraphrase paragraphs, condition on surrounding context, and control
lexical diversity and content reordering. Using DIPPER to paraphrase text
generated by three large language models (including GPT3.5-davinci-003)
successfully evades several detectors, including watermarking, GPTZero,
DetectGPT, and OpenAI's text classifier. For example, DIPPER drops detection
accuracy of DetectGPT from 70.3% to 4.6% (at a constant false positive rate of
1%), without appreciably modifying the input semantics.
To increase the robustness of AI-generated text detection to paraphrase
attacks, we introduce a simple defense that relies on retrieving
semantically-similar generations and must be maintained by a language model API
provider. Given a candidate text, our algorithm searches a database of
sequences previously generated by the API, looking for sequences that match the
candidate text within a certain threshold. We empirically verify our defense
using a database of 15M generations from a fine-tuned T5-XXL model and find
that it can detect 80% to 97% of paraphrased generations across different
settings while only classifying 1% of human-written sequences as AI-generated.
We open-source our models, code and data.Comment: NeurIPS 2023 camera ready (32 pages). Code, models, data available in
https://github.com/martiansideofthemoon/ai-detection-paraphrase
Program Chairs' Report on Peer Review at ACL 2023
We present a summary of the efforts to improve conference peer review that were implemented at ACL'23. This includes work with the goal of improving review quality, clearer workflow and decision support for the area chairs, as well as our efforts to improve paper-reviewer matching for various kinds of non- mainstream NLP work, and improve the overall incentives for all participants of the peer review process. We present analysis of the factors affecting peer review, identify the most problematic issues that the authors complained about, and provide suggestions for the future chairs. We hope that publishing such reports would (a) improve transparency in decision-making, (b) help the people new to the field to understand how the *ACL conferences work, (c) provide useful data for the future chairs and workshop organizers, and also academic work on peer review, and (d) provide useful context for the final program, as a source of information for meta-research on the structure and trajectory of the field of NLP