958 research outputs found

    Response type selection for chat-like spoken dialog systems based on LSTM and multi-task learning

    Get PDF
    We propose a method of automatically selecting appropriate responses in conversational spoken dialog systems by explicitly determining the correct response type that is needed first, based on a comparison of the userโ€™s input utterance with many other utterances. Response utterances are then generated based on this response type designation (back channel, changing the topic, expanding the topic, etc.). This allows the generation of more appropriate responses than conventional end-to-end approaches, which only use the userโ€™s input to directly generate response utterances. As a response type selector, we propose an LSTM-based encoderโ€“decoder framework utilizing acoustic and linguistic features extracted from input utterances. In order to extract these features more accurately, we utilize not only input utterances but also response utterances in the training corpus. To do so, multi-task learning using multiple decoders is also investigated. To evaluate our proposed method, we conducted experiments using a corpus of dialogs between elderly people and an interviewer. Our proposed method outperformed conventional methods using either a point-wise classifier based on Support Vector Machines, or a single-task learning LSTM. The best performance was achieved when our two response type selectors (one trained using acoustic features, and the other trained using linguistic features) were combined, and multi-task learning was also performed

    ์Œ์„ฑ์–ธ์–ด ์ดํ•ด์—์„œ์˜ ์ค‘์˜์„ฑ ํ•ด์†Œ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022. 8. ๊น€๋‚จ์ˆ˜.์–ธ์–ด์˜ ์ค‘์˜์„ฑ์€ ํ•„์—ฐ์ ์ด๋‹ค. ๊ทธ๊ฒƒ์€ ์–ธ์–ด๊ฐ€ ์˜์‚ฌ ์†Œํ†ต์˜ ์ˆ˜๋‹จ์ด์ง€๋งŒ, ๋ชจ๋“  ์‚ฌ๋žŒ์ด ์ƒ๊ฐํ•˜๋Š” ์–ด๋–ค ๊ฐœ๋…์ด ์™„๋ฒฝํžˆ ๋™์ผํ•˜๊ฒŒ ์ „๋‹ฌ๋  ์ˆ˜ ์—†๋Š” ๊ฒƒ์— ๊ธฐ์ธํ•œ๋‹ค. ์ด๋Š” ํ•„์—ฐ์ ์ธ ์š”์†Œ์ด๊ธฐ๋„ ํ•˜์ง€๋งŒ, ์–ธ์–ด ์ดํ•ด์—์„œ ์ค‘์˜์„ฑ์€ ์ข…์ข… ์˜์‚ฌ ์†Œํ†ต์˜ ๋‹จ์ ˆ์ด๋‚˜ ์‹คํŒจ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ๋„ ํ•œ๋‹ค. ์–ธ์–ด์˜ ์ค‘์˜์„ฑ์—๋Š” ๋‹ค์–‘ํ•œ ์ธต์œ„๊ฐ€ ์กด์žฌํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋ชจ๋“  ์ƒํ™ฉ์—์„œ ์ค‘์˜์„ฑ์ด ํ•ด์†Œ๋  ํ•„์š”๋Š” ์—†๋‹ค. ํƒœ์Šคํฌ๋งˆ๋‹ค, ๋„๋ฉ”์ธ๋งˆ๋‹ค ๋‹ค๋ฅธ ์–‘์ƒ์˜ ์ค‘์˜์„ฑ์ด ์กด์žฌํ•˜๋ฉฐ, ์ด๋ฅผ ์ž˜ ์ •์˜ํ•˜๊ณ  ํ•ด์†Œ๋  ์ˆ˜ ์žˆ๋Š” ์ค‘์˜์„ฑ์ž„์„ ํŒŒ์•…ํ•œ ํ›„ ์ค‘์˜์ ์ธ ๋ถ€๋ถ„ ๊ฐ„์˜ ๊ฒฝ๊ณ„๋ฅผ ์ž˜ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๋ณธ๊ณ ์—์„œ๋Š” ์Œ์„ฑ ์–ธ์–ด ์ฒ˜๋ฆฌ, ํŠนํžˆ ์˜๋„ ์ดํ•ด์— ์žˆ์–ด ์–ด๋–ค ์–‘์ƒ์˜ ์ค‘์˜์„ฑ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ณ , ์ด๋ฅผ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ ๋‹ค์–‘ํ•œ ์–ธ์–ด์—์„œ ๋ฐœ์ƒํ•˜์ง€๋งŒ, ๊ทธ ์ •๋„ ๋ฐ ์–‘์ƒ์€ ์–ธ์–ด์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์šฐ๋ฆฌ์˜ ์—ฐ๊ตฌ์—์„œ ์ฃผ๋ชฉํ•˜๋Š” ๋ถ€๋ถ„์€, ์Œ์„ฑ ์–ธ์–ด์— ๋‹ด๊ธด ์ •๋ณด๋Ÿ‰๊ณผ ๋ฌธ์ž ์–ธ์–ด์˜ ์ •๋ณด๋Ÿ‰ ์ฐจ์ด๋กœ ์ธํ•ด ์ค‘์˜์„ฑ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ๋“ค์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์šด์œจ(prosody)์— ๋”ฐ๋ผ ๋ฌธ์žฅ ํ˜•์‹ ๋ฐ ์˜๋„๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€ ํ•œ๊ตญ์–ด๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค. ํ•œ๊ตญ์–ด์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์ด ์žˆ๋Š”(multi-functionalํ•œ) ์ข…๊ฒฐ์–ด๋ฏธ(sentence ender), ๋นˆ๋ฒˆํ•œ ํƒˆ๋ฝ ํ˜„์ƒ(pro-drop), ์˜๋ฌธ์‚ฌ ๊ฐ„์„ญ(wh-intervention) ๋“ฑ์œผ๋กœ ์ธํ•ด, ๊ฐ™์€ ํ…์ŠคํŠธ๊ฐ€ ์—ฌ๋Ÿฌ ์˜๋„๋กœ ์ฝํžˆ๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๊ณค ํ•œ๋‹ค. ์ด๊ฒƒ์ด ์˜๋„ ์ดํ•ด์— ํ˜ผ์„ ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฐ์— ์ฐฉ์•ˆํ•˜์—ฌ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์˜์„ฑ์„ ๋จผ์ € ์ •์˜ํ•˜๊ณ , ์ค‘์˜์ ์ธ ๋ฌธ์žฅ๋“ค์„ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ง๋ญ‰์น˜๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค. ์˜๋„ ์ดํ•ด๋ฅผ ์œ„ํ•œ ๋ง๋ญ‰์น˜๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฌธ์žฅ์˜ ์ง€ํ–ฅ์„ฑ(directivity)๊ณผ ์ˆ˜์‚ฌ์„ฑ(rhetoricalness)์ด ๊ณ ๋ ค๋œ๋‹ค. ์ด๊ฒƒ์€ ์Œ์„ฑ ์–ธ์–ด์˜ ์˜๋„๋ฅผ ์„œ์ˆ , ์งˆ๋ฌธ, ๋ช…๋ น, ์ˆ˜์‚ฌ์˜๋ฌธ๋ฌธ, ๊ทธ๋ฆฌ๊ณ  ์ˆ˜์‚ฌ๋ช…๋ น๋ฌธ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ฒŒ ํ•˜๋Š” ๊ธฐ์ค€์ด ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ธฐ๋ก๋œ ์Œ์„ฑ ์–ธ์–ด(spoken language)๋ฅผ ์ถฉ๋ถ„ํžˆ ๋†’์€ ์ผ์น˜๋„(kappa = 0.85)๋กœ ์ฃผ์„ํ•œ ๋ง๋ญ‰์น˜๋ฅผ ์ด์šฉํ•ด, ์Œ์„ฑ์ด ์ฃผ์–ด์ง€์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ ์ค‘์˜์ ์ธ ํ…์ŠคํŠธ๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ๋ฐ์— ์–ด๋–ค ์ „๋žต ํ˜น์€ ์–ธ์–ด ๋ชจ๋ธ์ด ํšจ๊ณผ์ ์ธ๊ฐ€๋ฅผ ๋ณด์ด๊ณ , ํ•ด๋‹น ํƒœ์Šคํฌ์˜ ํŠน์ง•์„ ์ •์„ฑ์ ์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” ํ…์ŠคํŠธ ์ธต์œ„์—์„œ๋งŒ ์ค‘์˜์„ฑ์— ์ ‘๊ทผํ•˜์ง€ ์•Š๊ณ , ์‹ค์ œ๋กœ ์Œ์„ฑ์ด ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์ค‘์˜์„ฑ ํ•ด์†Œ(disambiguation)๊ฐ€ ๊ฐ€๋Šฅํ•œ์ง€๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด, ํ…์ŠคํŠธ๊ฐ€ ์ค‘์˜์ ์ธ ๋ฐœํ™”๋“ค๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ธ๊ณต์ ์ธ ์Œ์„ฑ ๋ง๋ญ‰์น˜๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์ง‘์ค‘(attention) ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง(neural network) ๋ชจ๋ธ๋“ค์„ ์ด์šฉํ•ด ์ค‘์˜์„ฑ์„ ํ•ด์†Œํ•œ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ํ†ต์‚ฌ์ /์˜๋ฏธ์  ์ค‘์˜์„ฑ ํ•ด์†Œ๊ฐ€ ์–ด๋– ํ•œ ๊ฒฝ์šฐ์— ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ์ง€ ๊ด€์ฐฐํ•˜๊ณ , ์ธ๊ฐ„์˜ ์–ธ์–ด ์ฒ˜๋ฆฌ์™€ ์–ด๋–ค ์—ฐ๊ด€์ด ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋งˆ์ง€๋ง‰์œผ๋กœ, ์œ„์™€ ๊ฐ™์€ ์ ˆ์ฐจ๋กœ ์˜๋„ ์ดํ•ด ๊ณผ์ •์—์„œ์˜ ์ค‘์˜์„ฑ์ด ํ•ด์†Œ๋˜์—ˆ์„ ๊ฒฝ์šฐ, ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ์‚ฐ์—…๊ณ„ ํ˜น์€ ์—ฐ๊ตฌ ๋‹จ์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ๊ฐ„๋žตํ•œ ๋กœ๋“œ๋งต์„ ์ œ์‹œํ•œ๋‹ค. ํ…์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ์ค‘์˜์„ฑ ํŒŒ์•…๊ณผ ์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ์˜๋„ ์ดํ•ด ๋ชจ๋“ˆ์„ ํ†ตํ•ฉํ•œ๋‹ค๋ฉด, ์˜ค๋ฅ˜์˜ ์ „ํŒŒ๋ฅผ ์ค„์ด๋ฉด์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ์ค‘์˜์„ฑ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์€ ๋Œ€ํ™” ๋งค๋‹ˆ์ €(dialogue manager)์™€ ํ†ตํ•ฉ๋˜์–ด ๊ฐ„๋‹จํ•œ ๋Œ€ํ™”(chit-chat)๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋ชฉ์  ์ง€ํ–ฅ ๋Œ€ํ™” ์‹œ์Šคํ…œ(task-oriented dialogue system)์„ ๊ตฌ์ถ•ํ•  ์ˆ˜๋„ ์žˆ๊ณ , ๋‹จ์ผ ์–ธ์–ด ์กฐ๊ฑด(monolingual condition)์„ ๋„˜์–ด ์Œ์„ฑ ๋ฒˆ์—ญ์—์„œ์˜ ์—๋Ÿฌ๋ฅผ ์ค„์ด๋Š” ๋ฐ์— ํ™œ์šฉ๋  ์ˆ˜๋„ ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ณธ๊ณ ๋ฅผ ํ†ตํ•ด, ์šด์œจ์— ๋ฏผ๊ฐํ•œ(prosody-sensitive) ์–ธ์–ด์—์„œ ์˜๋„ ์ดํ•ด๋ฅผ ์œ„ํ•œ ์ค‘์˜์„ฑ ํ•ด์†Œ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ด๋ฅผ ์‚ฐ์—… ๋ฐ ์—ฐ๊ตฌ ๋‹จ์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ด๊ณ ์ž ํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๊ฐ€ ๋‹ค๋ฅธ ์–ธ์–ด ๋ฐ ๋„๋ฉ”์ธ์—์„œ๋„ ๊ณ ์งˆ์ ์ธ ์ค‘์˜์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋˜๊ธธ ๋ฐ”๋ผ๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๋ฐ์— ํ™œ์šฉ๋œ ๋ฆฌ์†Œ์Šค, ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ์ฝ”๋“œ๋“ค์„ ๊ณต์œ ํ•จ์œผ๋กœ์จ ํ•™๊ณ„์˜ ๋ฐœ์ „์— ์ด๋ฐ”์ง€ํ•˜๊ณ ์ž ํ•œ๋‹ค.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication. There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved. In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text. Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding. In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task. We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner. Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions. Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages โ€‹โ€‹or other domains, linking linguistic science and engineering approaches.1 Introduction 1 1.1 Motivation 2 1.2 Research Goal 4 1.3 Outline of the Dissertation 5 2 Related Work 6 2.1 Spoken Language Understanding 6 2.2 Speech Act and Intention 8 2.2.1 Performatives and statements 8 2.2.2 Illocutionary act and speech act 9 2.2.3 Formal semantic approaches 11 2.3 Ambiguity of Intention Understanding in Korean 14 2.3.1 Ambiguities in language 14 2.3.2 Speech act and intention understanding in Korean 16 3 Ambiguity in Intention Understanding of Spoken Language 20 3.1 Intention Understanding and Ambiguity 20 3.2 Annotation Protocol 23 3.2.1 Fragments 24 3.2.2 Clear-cut cases 26 3.2.3 Intonation-dependent utterances 28 3.3 Data Construction . 32 3.3.1 Source scripts 32 3.3.2 Agreement 32 3.3.3 Augmentation 33 3.3.4 Train split 33 3.4 Experiments and Results 34 3.4.1 Models 34 3.4.2 Implementation 36 3.4.3 Results 37 3.5 Findings and Summary 44 3.5.1 Findings 44 3.5.2 Summary 45 4 Disambiguation of Speech Intention 47 4.1 Ambiguity Resolution 47 4.1.1 Prosody and syntax 48 4.1.2 Disambiguation with prosody 50 4.1.3 Approaches in SLU 50 4.2 Dataset Construction 51 4.2.1 Script generation 52 4.2.2 Label tagging 54 4.2.3 Recording 56 4.3 Experiments and Results 57 4.3.1 Models 57 4.3.2 Results 60 4.4 Summary 63 5 System Integration and Application 65 5.1 System Integration for Intention Identification 65 5.1.1 Proof of concept 65 5.1.2 Preliminary study 69 5.2 Application to Spoken Dialogue System 75 5.2.1 What is 'Free-running' 76 5.2.2 Omakase chatbot 76 5.3 Beyond Monolingual Approaches 84 5.3.1 Spoken language translation 85 5.3.2 Dataset 87 5.3.3 Analysis 94 5.3.4 Discussion 95 5.4 Summary 100 6 Conclusion and Future Work 103 Bibliography 105 Abstract (In Korean) 124 Acknowledgment 126๋ฐ•

    Modeling the user state for context-aware spoken interaction in ambient assisted living

    Get PDF
    Ambient Assisted Living (AAL) systems must provide adapted services easily accessible by a wide variety of users. This can only be possible if the communication between the user and the system is carried out through an interface that is simple, rapid, effective, and robust. Natural language interfaces such as dialog systems fulfill these requisites, as they are based on a spoken conversation that resembles human communication. In this paper, we enhance systems interacting in AAL domains by means of incorporating context-aware conversational agents that consider the external context of the interaction and predict the user's state. The user's state is built on the basis of their emotional state and intention, and it is recognized by means of a module conceived as an intermediate phase between natural language understanding and dialog management in the architecture of the conversational agent. This prediction, carried out for each user turn in the dialog, makes it possible to adapt the system dynamically to the user's needs. We have evaluated our proposal developing a context-aware system adapted to patients suffering from chronic pulmonary diseases, and provide a detailed discussion of the positive influence of our proposal in the success of the interaction, the information and services provided, as well as the perceived quality.This work was supported in part by Projects MINECO TEC2012-37832-C02-01, CICYT TEC2011-28626-C02- 02, CAM CONTEXTS (S2009/TIC-1485

    Proceedings of the LREC 2018 Special Speech Sessions

    Get PDF
    LREC 2018 Special Speech Sessions "Speech Resources Collection in Real-World Situations"; Phoenix Seagaia Conference Center, Miyazaki; 2018-05-0

    Mirroring to Build Trust in Digital Assistants

    Full text link
    We describe experiments towards building a conversational digital assistant that considers the preferred conversational style of the user. In particular, these experiments are designed to measure whether users prefer and trust an assistant whose conversational style matches their own. To this end we conducted a user study where subjects interacted with a digital assistant that responded in a way that either matched their conversational style, or did not. Using self-reported personality attributes and subjects' feedback on the interactions, we built models that can reliably predict a user's preferred conversational style.Comment: Preprin
    • โ€ฆ
    corecore