17 research outputs found
Vector-Quantized Prompt Learning for Paraphrase Generation
Deep generative modeling of natural languages has achieved many successes,
such as producing fluent sentences and translating from one language into
another. However, the development of generative modeling techniques for
paraphrase generation still lags behind largely due to the challenges in
addressing the complex conflicts between expression diversity and semantic
preservation. This paper proposes to generate diverse and high-quality
paraphrases by exploiting the pre-trained models with instance-dependent
prompts. To learn generalizable prompts, we assume that the number of abstract
transforming patterns of paraphrase generation (governed by prompts) is finite
and usually not large. Therefore, we present vector-quantized prompts as the
cues to control the generation of pre-trained models. Extensive experiments
demonstrate that the proposed method achieves new state-of-art results on three
benchmark datasets, including Quora, Wikianswers, and MSCOCO. We will release
all the code upon acceptance.Comment: EMNLP Findings, 202
How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed Questions
We present a large-scale dataset for the task of rewriting an ill-formed
natural language question to a well-formed one. Our multi-domain question
rewriting MQR dataset is constructed from human contributed Stack Exchange
question edit histories. The dataset contains 427,719 question pairs which come
from 303 domains. We provide human annotations for a subset of the dataset as a
quality estimate. When moving from ill-formed to well-formed questions, the
question quality improves by an average of 45 points across three aspects. We
train sequence-to-sequence neural models on the constructed dataset and obtain
an improvement of 13.2% in BLEU-4 over baseline methods built from other data
resources. We release the MQR dataset to encourage research on the problem of
question rewriting.Comment: AAAI 202
Paraphrase Types for Generation and Detection
Current approaches in paraphrase generation and detection heavily rely on a
single general similarity score, ignoring the intricate linguistic properties
of language. This paper introduces two new tasks to address this shortcoming by
considering paraphrase types - specific linguistic perturbations at particular
text positions. We name these tasks Paraphrase Type Generation and Paraphrase
Type Detection. Our results suggest that while current techniques perform well
in a binary classification scenario, i.e., paraphrased or not, the inclusion of
fine-grained paraphrase types poses a significant challenge. While most
approaches are good at generating and detecting general semantic similar
content, they fail to understand the intrinsic linguistic variables they
manipulate. Models trained in generating and identifying paraphrase types also
show improvements in tasks without them. In addition, scaling these models
further improves their ability to understand paraphrase types. We believe
paraphrase types can unlock a new paradigm for developing paraphrase models and
solving tasks in the future.Comment: Published at EMNLP 202
Unsupervised Paraphrasing via Deep Reinforcement Learning
Paraphrasing is expressing the meaning of an input sentence in different
wording while maintaining fluency (i.e., grammatical and syntactical
correctness). Most existing work on paraphrasing use supervised models that are
limited to specific domains (e.g., image captions). Such models can neither be
straightforwardly transferred to other domains nor generalize well, and
creating labeled training data for new domains is expensive and laborious. The
need for paraphrasing across different domains and the scarcity of labeled
training data in many such domains call for exploring unsupervised paraphrase
generation methods. We propose Progressive Unsupervised Paraphrasing (PUP): a
novel unsupervised paraphrase generation method based on deep reinforcement
learning (DRL). PUP uses a variational autoencoder (trained using a
non-parallel corpus) to generate a seed paraphrase that warm-starts the DRL
model. Then, PUP progressively tunes the seed paraphrase guided by our novel
reward function which combines semantic adequacy, language fluency, and
expression diversity measures to quantify the quality of the generated
paraphrases in each iteration without needing parallel sentences. Our extensive
experimental evaluation shows that PUP outperforms unsupervised
state-of-the-art paraphrasing techniques in terms of both automatic metrics and
user studies on four real datasets. We also show that PUP outperforms
domain-adapted supervised algorithms on several datasets. Our evaluation also
shows that PUP achieves a great trade-off between semantic similarity and
diversity of expression