1,956 research outputs found
SF-DST: Few-Shot Self-Feeding Reading Comprehension Dialogue State Tracking with Auxiliary Task
Few-shot dialogue state tracking (DST) model tracks user requests in dialogue
with reliable accuracy even with a small amount of data. In this paper, we
introduce an ontology-free few-shot DST with self-feeding belief state input.
The self-feeding belief state input increases the accuracy in multi-turn
dialogue by summarizing previous dialogue. Also, we newly developed a slot-gate
auxiliary task. This new auxiliary task helps classify whether a slot is
mentioned in the dialogue. Our model achieved the best score in a few-shot
setting for four domains on multiWOZ 2.0.Comment: Accepted in INTERSPEECH 202
Show, Don't Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue
Building universal dialogue systems that can seamlessly operate across
multiple domains/APIs and generalize to new ones with minimal supervision and
maintenance is a critical challenge. Recent works have leveraged natural
language descriptions for schema elements to enable such systems; however,
descriptions can only indirectly convey schema semantics. In this work, we
propose Show, Don't Tell, a prompt format for seq2seq modeling which uses a
short labeled example dialogue to show the semantics of schema elements rather
than tell the model via descriptions. While requiring similar effort from
service developers, we show that using short examples as schema representations
with large language models results in stronger performance and better
generalization on two popular dialogue state tracking benchmarks: the
Schema-Guided Dialogue dataset and the MultiWoZ leave-one-out benchmark.Comment: To appear at NAACL 202
ChatGPT for Zero-shot Dialogue State Tracking: A Solution or an Opportunity?
Recent research on dialogue state tracking (DST) focuses on methods that
allow few- and zero-shot transfer to new domains or schemas. However,
performance gains heavily depend on aggressive data augmentation and
fine-tuning of ever larger language model based architectures. In contrast,
general purpose language models, trained on large amounts of diverse data, hold
the promise of solving any kind of task without task-specific training. We
present preliminary experimental results on the ChatGPT research preview,
showing that ChatGPT achieves state-of-the-art performance in zero-shot DST.
Despite our findings, we argue that properties inherent to general purpose
models limit their ability to replace specialized systems. We further theorize
that the in-context learning capabilities of such models will likely become
powerful tools to support the development of dedicated and dynamic dialogue
state trackers.Comment: 13 pages, 3 figures, accepted at ACL 202
UniPCM: Universal Pre-trained Conversation Model with Task-aware Automatic Prompt
Recent research has shown that multi-task pre-training greatly improves the
model's robustness and transfer ability, which is crucial for building a
high-quality dialog system. However, most previous works on multi-task
pre-training rely heavily on human-defined input format or prompt, which is not
optimal in quality and quantity. In this work, we propose to use Task-based
Automatic Prompt generation (TAP) to automatically generate high-quality
prompts. Using the high-quality prompts generated, we scale the corpus of the
pre-trained conversation model to 122 datasets from 15 dialog-related tasks,
resulting in Universal Pre-trained Conversation Model (UniPCM), a powerful
foundation model for various conversational tasks and different dialog systems.
Extensive experiments have shown that UniPCM is robust to input prompts and
capable of various dialog-related tasks. Moreover, UniPCM has strong transfer
ability and excels at low resource scenarios, achieving SOTA results on 9
different datasets ranging from task-oriented dialog to open-domain
conversation. Furthermore, we are amazed to find that TAP can generate prompts
on par with those collected with crowdsourcing. The code is released with the
paper
Grounding Description-Driven Dialogue State Trackers with Knowledge-Seeking Turns
Schema-guided dialogue state trackers can generalise to new domains without
further training, yet they are sensitive to the writing style of the schemata.
Augmenting the training set with human or synthetic schema paraphrases improves
the model robustness to these variations but can be either costly or difficult
to control. We propose to circumvent these issues by grounding the state
tracking model in knowledge-seeking turns collected from the dialogue corpus as
well as the schema. Including these turns in prompts during finetuning and
inference leads to marked improvements in model robustness, as demonstrated by
large average joint goal accuracy and schema sensitivity improvements on SGD
and SGD-X.Comment: Best Long Paper of SIGDIAL 202
Make-A-Voice: Unified Voice Synthesis With Discrete Representation
Various applications of voice synthesis have been developed independently
despite the fact that they generate "voice" as output in common. In addition,
the majority of voice synthesis models currently rely on annotated audio data,
but it is crucial to scale them to self-supervised datasets in order to
effectively capture the wide range of acoustic variations present in human
voice, including speaker identity, emotion, and prosody. In this work, we
propose Make-A-Voice, a unified framework for synthesizing and manipulating
voice signals from discrete representations. Make-A-Voice leverages a
"coarse-to-fine" approach to model the human voice, which involves three
stages: 1) semantic stage: model high-level transformation between linguistic
content and self-supervised semantic tokens, 2) acoustic stage: introduce
varying control signals as acoustic conditions for semantic-to-acoustic
modeling, and 3) generation stage: synthesize high-fidelity waveforms from
acoustic tokens. Make-A-Voice offers notable benefits as a unified voice
synthesis framework: 1) Data scalability: the major backbone (i.e., acoustic
and generation stage) does not require any annotations, and thus the training
data could be scaled up. 2) Controllability and conditioning flexibility: we
investigate different conditioning mechanisms and effectively handle three
voice synthesis applications, including text-to-speech (TTS), voice conversion
(VC), and singing voice synthesis (SVS) by re-synthesizing the discrete voice
representations with prompt guidance. Experimental results demonstrate that
Make-A-Voice exhibits superior audio quality and style similarity compared with
competitive baseline models. Audio samples are available at
https://Make-A-Voice.github.i
- …