781 research outputs found
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost
1 million multi-turn dialogues, with a total of over 7 million utterances and
100 million words. This provides a unique resource for research into building
dialogue managers based on neural language models that can make use of large
amounts of unlabeled data. The dataset has both the multi-turn property of
conversations in the Dialog State Tracking Challenge datasets, and the
unstructured nature of interactions from microblog services such as Twitter. We
also describe two neural learning architectures suitable for analyzing this
dataset, and provide benchmark performance on the task of selecting the best
next response.Comment: SIGDIAL 2015. 10 pages, 5 figures. Update includes link to new
version of the dataset, with some added features and bug fixes. See:
https://github.com/rkadlec/ubuntu-ranking-dataset-creato
Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer
Conversation structure is useful for both understanding the nature of
conversation dynamics and for providing features for many downstream
applications such as summarization of conversations. In this work, we define
the problem of conversation structure modeling as identifying the parent
utterance(s) to which each utterance in the conversation responds to. Previous
work usually took a pair of utterances to decide whether one utterance is the
parent of the other. We believe the entire ancestral history is a very
important information source to make accurate prediction. Therefore, we design
a novel masking mechanism to guide the ancestor flow, and leverage the
transformer model to aggregate all ancestors to predict parent utterances. Our
experiments are performed on the Reddit dataset (Zhang, Culbertson, and
Paritosh 2017) and the Ubuntu IRC dataset (Kummerfeld et al. 2019). In
addition, we also report experiments on a new larger corpus from the Reddit
platform and release this dataset. We show that the proposed model, that takes
into account the ancestral history of the conversation, significantly
outperforms several strong baselines including the BERT model on all datasetsComment: AAAI 202
Text Style Transfer: A Review and Experimental Evaluation
The stylistic properties of text have intrigued computational linguistics
researchers in recent years. Specifically, researchers have investigated the
Text Style Transfer (TST) task, which aims to change the stylistic properties
of the text while retaining its style independent content. Over the last few
years, many novel TST algorithms have been developed, while the industry has
leveraged these algorithms to enable exciting TST applications. The field of
TST research has burgeoned because of this symbiosis. This article aims to
provide a comprehensive review of recent research efforts on text style
transfer. More concretely, we create a taxonomy to organize the TST models and
provide a comprehensive summary of the state of the art. We review the existing
evaluation methodologies for TST tasks and conduct a large-scale
reproducibility study where we experimentally benchmark 19 state-of-the-art TST
algorithms on two publicly available datasets. Finally, we expand on current
trends and provide new perspectives on the new and exciting developments in the
TST field
Conversation Disentanglement with Bi-Level Contrastive Learning
Conversation disentanglement aims to group utterances into detached sessions,
which is a fundamental task in processing multi-party conversations. Existing
methods have two main drawbacks. First, they overemphasize pairwise utterance
relations but pay inadequate attention to the utterance-to-context relation
modeling. Second, huge amount of human annotated data is required for training,
which is expensive to obtain in practice. To address these issues, we propose a
general disentangle model based on bi-level contrastive learning. It brings
closer utterances in the same session while encourages each utterance to be
near its clustered session prototypes in the representation space. Unlike
existing approaches, our disentangle model works in both supervised setting
with labeled data and unsupervised setting when no such data is available. The
proposed method achieves new state-of-the-art performance on both settings
across several public datasets
- …