13 research outputs found
A Continuously Growing Dataset of Sentential Paraphrases
A major challenge in paraphrase research is the lack of parallel corpora. In
this paper, we present a new method to collect large-scale sentential
paraphrases from Twitter by linking tweets through shared URLs. The main
advantage of our method is its simplicity, as it gets rid of the classifier or
human in the loop needed to select data before annotation and subsequent
application of paraphrase identification algorithms in the previous work. We
present the largest human-labeled paraphrase corpus to date of 51,524 sentence
pairs and the first cross-domain benchmarking for automatic paraphrase
identification. In addition, we show that more than 30,000 new sentential
paraphrases can be easily and continuously captured every month at ~70%
precision, and demonstrate their utility for downstream NLP tasks through
phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
Identifying Machine-Paraphrased Plagiarism
Employing paraphrasing tools to conceal plagiarized text is a severe threat
to academic integrity. To enable the detection of machine-paraphrased text, we
evaluate the effectiveness of five pre-trained word embedding models combined
with machine learning classifiers and state-of-the-art neural language models.
We analyze preprints of research papers, graduation theses, and Wikipedia
articles, which we paraphrased using different configurations of the tools
SpinBot and SpinnerChief. The best performing technique, Longformer, achieved
an average F1 score of 80.99% (F1=99.68% for SpinBot and F1=71.64% for
SpinnerChief cases), while human evaluators achieved F1=78.4% for SpinBot and
F1=65.6% for SpinnerChief cases. We show that the automated classification
alleviates shortcomings of widely-used text-matching systems, such as Turnitin
and PlagScan. To facilitate future research, all data, code, and two web
applications showcasing our contributions are openly available
Unsupervised Paraphrasing via Deep Reinforcement Learning
Paraphrasing is expressing the meaning of an input sentence in different
wording while maintaining fluency (i.e., grammatical and syntactical
correctness). Most existing work on paraphrasing use supervised models that are
limited to specific domains (e.g., image captions). Such models can neither be
straightforwardly transferred to other domains nor generalize well, and
creating labeled training data for new domains is expensive and laborious. The
need for paraphrasing across different domains and the scarcity of labeled
training data in many such domains call for exploring unsupervised paraphrase
generation methods. We propose Progressive Unsupervised Paraphrasing (PUP): a
novel unsupervised paraphrase generation method based on deep reinforcement
learning (DRL). PUP uses a variational autoencoder (trained using a
non-parallel corpus) to generate a seed paraphrase that warm-starts the DRL
model. Then, PUP progressively tunes the seed paraphrase guided by our novel
reward function which combines semantic adequacy, language fluency, and
expression diversity measures to quantify the quality of the generated
paraphrases in each iteration without needing parallel sentences. Our extensive
experimental evaluation shows that PUP outperforms unsupervised
state-of-the-art paraphrasing techniques in terms of both automatic metrics and
user studies on four real datasets. We also show that PUP outperforms
domain-adapted supervised algorithms on several datasets. Our evaluation also
shows that PUP achieves a great trade-off between semantic similarity and
diversity of expression
NIR-Prompt: A Multi-task Generalized Neural Information Retrieval Training Framework
Information retrieval aims to find information that meets users' needs from
the corpus. Different needs correspond to different IR tasks such as document
retrieval, open-domain question answering, retrieval-based dialogue, etc.,
while they share the same schema to estimate the relationship between texts. It
indicates that a good IR model can generalize to different tasks and domains.
However, previous studies indicate that state-of-the-art neural information
retrieval (NIR) models, e.g, pre-trained language models (PLMs) are hard to
generalize. Mainly because the end-to-end fine-tuning paradigm makes the model
overemphasize task-specific signals and domain biases but loses the ability to
capture generalized essential signals. To address this problem, we propose a
novel NIR training framework named NIR-Prompt for retrieval and reranking
stages based on the idea of decoupling signal capturing and combination.
NIR-Prompt exploits Essential Matching Module (EMM) to capture the essential
matching signals and gets the description of tasks by Matching Description
Module (MDM). The description is used as task-adaptation information to combine
the essential matching signals to adapt to different tasks. Experiments under
in-domain multi-task, out-of-domain multi-task, and new task adaptation
settings show that NIR-Prompt can improve the generalization of PLMs in NIR for
both retrieval and reranking stages compared with baselines.Comment: This article is the extension of arXiv:2204.02725 and accepted by
TOI
Mitigating The Shortcomings of Language Models: Strategies For Handling Memorization & Adversarial Attacks
Deep learning models have recently achieved remarkable progress in Natural Language Processing (NLP), specifically in classification, question-answering, and machine translation. However, NLP models face challenges related to security and privacy. Security-wise, even small perturbations in the input can significantly impact a model\u27s prediction. This highlights the importance of generating natural adversarial attacks to analyze the weaknesses of NLP models and bolster their robustness through adversarial training (AT). Conversely, Large Language Models (LLMs) are trained on vast amounts of data, which may include sensitive information. If exposed, this poses a risk to personal privacy. LLMs can memorize portions of their training data and reproduce them verbatim when prompted by adversaries. To address these limitations, we delve into the potential of reinforcement learning (RL) based methods in tackling these issues and surmounting the shortcomings present in the existing literature. RL excels in achieving specific objectives guided by a reward function. In pursuit of this, we introduce an End-to-End framework that employs a proximal policy gradient—a reinforcement learning approach—to cultivate a self-learned policy directed by the chosen reward function. The language model (LM) takes on the role of a policy learner. For adversarial attacks, we opt for a combination of the mutual implication score and the negative likelihood of samples generated by the victim classifier. This approach allows us to craft perplexing samples while preserving their semantic significance. In addressing memorization, we employ the negative similarity function, BERTScore, to develop a Dememorization Privacy Policy. This policy effectively mitigates the risks associated with memorization. Our findings indicate that our framework has proven effective in enhancing the performance of the vanilla classifier by 2% when generating adversarial attacks and reducing LM memorization by 34% % to mitigate privacy risks while maintaining the general LM performance