9,902 research outputs found
Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness
Large-scale dialogue datasets have recently become available for training
neural dialogue agents. However, these datasets have been reported to contain a
non-negligible number of unacceptable utterance pairs. In this paper, we
propose a method for scoring the quality of utterance pairs in terms of their
connectivity and relatedness. The proposed scoring method is designed based on
findings widely shared in the dialogue and linguistics research communities. We
demonstrate that it has a relatively good correlation with the human judgment
of dialogue quality. Furthermore, the method is applied to filter out
potentially unacceptable utterance pairs from a large-scale noisy dialogue
corpus to ensure its quality. We experimentally confirm that training data
filtered by the proposed method improves the quality of neural dialogue agents
in response generation.Comment: 18 pages, Accepted at The 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2020
Detection of social signals for recognizing engagement in human-robot interaction
Detection of engagement during a conversation is an important function of
human-robot interaction. The level of user engagement can influence the
dialogue strategy of the robot. Our motivation in this work is to detect
several behaviors which will be used as social signal inputs for a real-time
engagement recognition model. These behaviors are nodding, laughter, verbal
backchannels and eye gaze. We describe models of these behaviors which have
been learned from a large corpus of human-robot interactions with the android
robot ERICA. Input data to the models comes from a Kinect sensor and a
microphone array. Using our engagement recognition model, we can achieve
reasonable performance using the inputs from automatic social signal detection,
compared to using manual annotation as input.Comment: AAAI Fall Symposium on Natural Communication for Human-Robot
Collaboration, 201
The Design and Implementation of XiaoIce, an Empathetic Social Chatbot
This paper describes the development of Microsoft XiaoIce, the most popular
social chatbot in the world. XiaoIce is uniquely designed as an AI companion
with an emotional connection to satisfy the human need for communication,
affection, and social belonging. We take into account both intelligent quotient
(IQ) and emotional quotient (EQ) in system design, cast human-machine social
chat as decision-making over Markov Decision Processes (MDPs), and optimize
XiaoIce for long-term user engagement, measured in expected Conversation-turns
Per Session (CPS). We detail the system architecture and key components
including dialogue manager, core chat, skills, and an empathetic computing
module. We show how XiaoIce dynamically recognizes human feelings and states,
understands user intent, and responds to user needs throughout long
conversations. Since her launch in 2014, XiaoIce has communicated with over 660
million active users and succeeded in establishing long-term relationships with
many of them. Analysis of large scale online logs shows that XiaoIce has
achieved an average CPS of 23, which is significantly higher than that of other
chatbots and even human conversations
SLAM-Inspired Simultaneous Contextualization and Interpreting for Incremental Conversation Sentences
Distributed representation of words has improved the performance for many
natural language tasks. In many methods, however, only one meaning is
considered for one label of a word, and multiple meanings of polysemous words
depending on the context are rarely handled. Although research works have dealt
with polysemous words, they determine the meanings of such words according to a
batch of large documents. Hence, there are two problems with applying these
methods to sequential sentences, as in a conversation that contains ambiguous
expressions. The first problem is that the methods cannot sequentially deal
with the interdependence between context and word interpretation, in which
context is decided by word interpretations and the word interpretations are
decided by the context. Context estimation must thus be performed in parallel
to pursue multiple interpretations. The second problem is that the previous
methods use large-scale sets of sentences for offline learning of new
interpretations, and the steps of learning and inference are clearly separated.
Such methods using offline learning cannot obtain new interpretations during a
conversation. Hence, to dynamically estimate the conversation context and
interpretations of polysemous words in sequential sentences, we propose a
method of Simultaneous Contextualization And INterpreting (SCAIN) based on the
traditional Simultaneous Localization And Mapping (SLAM) algorithm. By using
the SCAIN algorithm, we can sequentially optimize the interdependence between
context and word interpretation while obtaining new interpretations online. For
experimental evaluation, we created two datasets: one from Wikipedia's
disambiguation pages and the other from real conversations. For both datasets,
the results confirmed that SCAIN could effectively achieve sequential
optimization of the interdependence and acquisition of new interpretations
Another Diversity-Promoting Objective Function for Neural Dialogue Generation
Although generation-based dialogue systems have been widely researched, the
response generations by most existing systems have very low diversities. The
most likely reason for this problem is Maximum Likelihood Estimation (MLE) with
Softmax Cross-Entropy (SCE) loss. MLE trains models to generate the most
frequent responses from enormous generation candidates, although in actual
dialogues there are various responses based on the context. In this paper, we
propose a new objective function called Inverse Token Frequency (ITF) loss,
which individually scales smaller loss for frequent token classes and larger
loss for rare token classes. This function encourages the model to generate
rare tokens rather than frequent tokens. It does not complicate the model and
its training is stable because we only replace the objective function. On the
OpenSubtitles dialogue dataset, our loss model establishes a state-of-the-art
DIST-1 of 7.56, which is the unigram diversity score, while maintaining a good
BLEU-1 score. On a Japanese Twitter replies dataset, our loss model achieves a
DIST-1 score comparable to the ground truth.Comment: AAAI 2019 Workshop on Reasoning and Learning for Human-Machine
Dialogues (DEEP-DIAL 2019
JESC: Japanese-English Subtitle Corpus
In this paper we describe the Japanese-English Subtitle Corpus (JESC). JESC
is a large Japanese-English parallel corpus covering the underrepresented
domain of conversational dialogue. It consists of more than 3.2 million
examples, making it the largest freely available dataset of its kind. The
corpus was assembled by crawling and aligning subtitles found on the web. The
assembly process incorporates a number of novel preprocessing elements to
ensure high monolingual fluency and accurate bilingual alignments. We summarize
its contents and evaluate its quality using human experts and baseline machine
translation (MT) systems.Comment: To appear at LREC 2018. Project website update
Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation
End-to-end Speech Translation (ST) models have many potential advantages when
compared to the cascade of Automatic Speech Recognition (ASR) and text Machine
Translation (MT) models, including lowered inference latency and the avoidance
of error compounding. However, the quality of end-to-end ST is often limited by
a paucity of training data, since it is difficult to collect large parallel
corpora of speech and translated transcript pairs. Previous studies have
proposed the use of pre-trained components and multi-task learning in order to
benefit from weakly supervised training data, such as speech-to-transcript or
text-to-foreign-text pairs. In this paper, we demonstrate that using
pre-trained MT or text-to-speech (TTS) synthesis models to convert weakly
supervised data into speech-to-translation pairs for ST training can be more
effective than multi-task learning. Furthermore, we demonstrate that a high
quality end-to-end ST model can be trained using only weakly supervised
datasets, and that synthetic data sourced from unlabeled monolingual text or
speech can be used to improve performance. Finally, we discuss methods for
avoiding overfitting to synthetic speech with a quantitative ablation study.Comment: ICASSP 201
ESPnet: End-to-End Speech Processing Toolkit
This paper introduces a new open source platform for end-to-end speech
processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech
recognition (ASR), and adopts widely-used dynamic neural network toolkits,
Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the
Kaldi ASR toolkit style for data processing, feature extraction/format, and
recipes to provide a complete setup for speech recognition and other speech
processing experiments. This paper explains a major architecture of this
software platform, several important functionalities, which differentiate
ESPnet from other open source ASR toolkits, and experimental results with major
ASR benchmarks
Towards automatic addressee identification in multi-party dialogues
The paper is about the issue of addressing in multi-party dialogues. Analysis of addressing behavior in face to face meetings results in the identification of several addressing mechanisms. From these we extract several utterance features and features of non-verbal communicative behavior of a speaker, like gaze and gesturing, that are relevant for observers to identify the participants the speaker is talking to. A method for the automatic prediction of the addressee of speech acts is discussed
Towards End-to-end Automatic Code-Switching Speech Recognition
Speech recognition in mixed language has difficulties to adapt end-to-end
framework due to the lack of data and overlapping phone sets, for example in
words such as "one" in English and "w\`an" in Chinese. We propose a CTC-based
end-to-end automatic speech recognition model for intra-sentential
English-Mandarin code-switching. The model is trained by joint training on
monolingual datasets, and fine-tuning with the mixed-language corpus. During
the decoding process, we apply a beam search and combine CTC predictions and
language model score. The proposed method is effective in leveraging
monolingual corpus and detecting language transitions and it improves the CER
by 5%.Comment: Submitted to ICASSP 201
- …