357 research outputs found

    Extracting Information from Spoken User Input:A Machine Learning Approach

    Get PDF
    We propose a module that performs automatic analysis of user input in spoken dialogue systems using machine learning algorithms. The input to the module is material received from the speech recogniser and the dialogue manager of the spoken dialogue system, the output is a four-level pragmatic-semantic representation of the user utterance. Our investigation shows that when the four interpretation levels are combined in a complex machine learning task, the performance of the module is significantly better than the score of an informed baseline strategy. However, via a systematic, automatised search for the optimal subtask combinations we can gain substantial improvement produced by both classifiers for all four interpretation subtasks. A case study is conducted on dialogues between an automatised, experimental system that gives information on the phone about train connections in the Netherlands, and its users who speak in Dutch. We find that drawing on unsophisticated, potentially noisy features that characterise the dialogue situation, and by performing automatic optimisation of the formulated machine learning task it is possible to extract sophisticated information of practical pragmatic-semantic value from spoken user input with robust performance. This means that our module can with a good score interpret whether the user of the system is giving slot-filling information, and for which query slots (e.g., departure station, departure time, etc.), whether the user gave a positive or a negative answer to the system, or whether the user signals that there are problems in the interaction.

    Combining heterogeneous inputs for the development of adaptive and multimodal interaction systems

    Get PDF
    In this paper we present a novel framework for the integration of visual sensor networks and speech-based interfaces. Our proposal follows the standard reference architecture in fusion systems (JDL), and combines different techniques related to Artificial Intelligence, Natural Language Processing and User Modeling to provide an enhanced interaction with their users. Firstly, the framework integrates a Cooperative Surveillance Multi-Agent System (CS-MAS), which includes several types of autonomous agents working in a coalition to track and make inferences on the positions of the targets. Secondly, enhanced conversational agents facilitate human-computer interaction by means of speech interaction. Thirdly, a statistical methodology allows modeling the user conversational behavior, which is learned from an initial corpus and improved with the knowledge acquired from the successive interactions. A technique is proposed to facilitate the multimodal fusion of these information sources and consider the result for the decision of the next system action.This work was supported in part by Projects MEyC TEC2012-37832-C02-01, CICYT TEC2011-28626-C02-02, CAM CONTEXTS S2009/TIC-1485Publicad

    Fillers in Spoken Language Understanding: Computational and Psycholinguistic Perspectives

    Full text link
    Disfluencies (i.e. interruptions in the regular flow of speech), are ubiquitous to spoken discourse. Fillers ("uh", "um") are disfluencies that occur the most frequently compared to other kinds of disfluencies. Yet, to the best of our knowledge, there isn't a resource that brings together the research perspectives influencing Spoken Language Understanding (SLU) on these speech events. This aim of this article is to synthesise a breadth of perspectives in a holistic way; i.e. from considering underlying (psycho)linguistic theory, to their annotation and consideration in Automatic Speech Recognition (ASR) and SLU systems, to lastly, their study from a generation standpoint. This article aims to present the perspectives in an approachable way to the SLU and Conversational AI community, and discuss moving forward, what we believe are the trends and challenges in each area.Comment: To appear in TAL Journa

    On the dynamic adaptation of language models based on dialogue information

    Get PDF
    We present an approach to adapt dynamically the language models (LMs) used by a speech recognizer that is part of a spoken dialogue system. We have developed a grammar generation strategy that automatically adapts the LMs using the semantic information that the user provides (represented as dialogue concepts), together with the information regarding the intentions of the speaker (inferred by the dialogue manager, and represented as dialogue goals). We carry out the adaptation as a linear interpolation between a background LM, and one or more of the LMs associated to the dialogue elements (concepts or goals) addressed by the user. The interpolation weights between those models are automatically estimated on each dialogue turn, using measures such as the posterior probabilities of concepts and goals, estimated as part of the inference procedure to determine the actions to be carried out. We propose two approaches to handle the LMs related to concepts and goals. Whereas in the first one we estimate a LM for each one of them, in the second one we apply several clustering strategies to group together those elements that share some common properties, and estimate a LM for each cluster. Our evaluation shows how the system can estimate a dynamic model adapted to each dialogue turn, which helps to improve the performance of the speech recognition (up to a 14.82% of relative improvement), which leads to an improvement in both the language understanding and the dialogue management tasks

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    ์Œ์„ฑ์–ธ์–ด ์ดํ•ด์—์„œ์˜ ์ค‘์˜์„ฑ ํ•ด์†Œ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2022. 8. ๊น€๋‚จ์ˆ˜.์–ธ์–ด์˜ ์ค‘์˜์„ฑ์€ ํ•„์—ฐ์ ์ด๋‹ค. ๊ทธ๊ฒƒ์€ ์–ธ์–ด๊ฐ€ ์˜์‚ฌ ์†Œํ†ต์˜ ์ˆ˜๋‹จ์ด์ง€๋งŒ, ๋ชจ๋“  ์‚ฌ๋žŒ์ด ์ƒ๊ฐํ•˜๋Š” ์–ด๋–ค ๊ฐœ๋…์ด ์™„๋ฒฝํžˆ ๋™์ผํ•˜๊ฒŒ ์ „๋‹ฌ๋  ์ˆ˜ ์—†๋Š” ๊ฒƒ์— ๊ธฐ์ธํ•œ๋‹ค. ์ด๋Š” ํ•„์—ฐ์ ์ธ ์š”์†Œ์ด๊ธฐ๋„ ํ•˜์ง€๋งŒ, ์–ธ์–ด ์ดํ•ด์—์„œ ์ค‘์˜์„ฑ์€ ์ข…์ข… ์˜์‚ฌ ์†Œํ†ต์˜ ๋‹จ์ ˆ์ด๋‚˜ ์‹คํŒจ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ๋„ ํ•œ๋‹ค. ์–ธ์–ด์˜ ์ค‘์˜์„ฑ์—๋Š” ๋‹ค์–‘ํ•œ ์ธต์œ„๊ฐ€ ์กด์žฌํ•œ๋‹ค. ํ•˜์ง€๋งŒ, ๋ชจ๋“  ์ƒํ™ฉ์—์„œ ์ค‘์˜์„ฑ์ด ํ•ด์†Œ๋  ํ•„์š”๋Š” ์—†๋‹ค. ํƒœ์Šคํฌ๋งˆ๋‹ค, ๋„๋ฉ”์ธ๋งˆ๋‹ค ๋‹ค๋ฅธ ์–‘์ƒ์˜ ์ค‘์˜์„ฑ์ด ์กด์žฌํ•˜๋ฉฐ, ์ด๋ฅผ ์ž˜ ์ •์˜ํ•˜๊ณ  ํ•ด์†Œ๋  ์ˆ˜ ์žˆ๋Š” ์ค‘์˜์„ฑ์ž„์„ ํŒŒ์•…ํ•œ ํ›„ ์ค‘์˜์ ์ธ ๋ถ€๋ถ„ ๊ฐ„์˜ ๊ฒฝ๊ณ„๋ฅผ ์ž˜ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๋ณธ๊ณ ์—์„œ๋Š” ์Œ์„ฑ ์–ธ์–ด ์ฒ˜๋ฆฌ, ํŠนํžˆ ์˜๋„ ์ดํ•ด์— ์žˆ์–ด ์–ด๋–ค ์–‘์ƒ์˜ ์ค‘์˜์„ฑ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ณ , ์ด๋ฅผ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ ๋‹ค์–‘ํ•œ ์–ธ์–ด์—์„œ ๋ฐœ์ƒํ•˜์ง€๋งŒ, ๊ทธ ์ •๋„ ๋ฐ ์–‘์ƒ์€ ์–ธ์–ด์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์šฐ๋ฆฌ์˜ ์—ฐ๊ตฌ์—์„œ ์ฃผ๋ชฉํ•˜๋Š” ๋ถ€๋ถ„์€, ์Œ์„ฑ ์–ธ์–ด์— ๋‹ด๊ธด ์ •๋ณด๋Ÿ‰๊ณผ ๋ฌธ์ž ์–ธ์–ด์˜ ์ •๋ณด๋Ÿ‰ ์ฐจ์ด๋กœ ์ธํ•ด ์ค‘์˜์„ฑ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ๋“ค์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์šด์œจ(prosody)์— ๋”ฐ๋ผ ๋ฌธ์žฅ ํ˜•์‹ ๋ฐ ์˜๋„๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€ ํ•œ๊ตญ์–ด๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค. ํ•œ๊ตญ์–ด์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์ด ์žˆ๋Š”(multi-functionalํ•œ) ์ข…๊ฒฐ์–ด๋ฏธ(sentence ender), ๋นˆ๋ฒˆํ•œ ํƒˆ๋ฝ ํ˜„์ƒ(pro-drop), ์˜๋ฌธ์‚ฌ ๊ฐ„์„ญ(wh-intervention) ๋“ฑ์œผ๋กœ ์ธํ•ด, ๊ฐ™์€ ํ…์ŠคํŠธ๊ฐ€ ์—ฌ๋Ÿฌ ์˜๋„๋กœ ์ฝํžˆ๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๊ณค ํ•œ๋‹ค. ์ด๊ฒƒ์ด ์˜๋„ ์ดํ•ด์— ํ˜ผ์„ ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฐ์— ์ฐฉ์•ˆํ•˜์—ฌ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ค‘์˜์„ฑ์„ ๋จผ์ € ์ •์˜ํ•˜๊ณ , ์ค‘์˜์ ์ธ ๋ฌธ์žฅ๋“ค์„ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ง๋ญ‰์น˜๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค. ์˜๋„ ์ดํ•ด๋ฅผ ์œ„ํ•œ ๋ง๋ญ‰์น˜๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฌธ์žฅ์˜ ์ง€ํ–ฅ์„ฑ(directivity)๊ณผ ์ˆ˜์‚ฌ์„ฑ(rhetoricalness)์ด ๊ณ ๋ ค๋œ๋‹ค. ์ด๊ฒƒ์€ ์Œ์„ฑ ์–ธ์–ด์˜ ์˜๋„๋ฅผ ์„œ์ˆ , ์งˆ๋ฌธ, ๋ช…๋ น, ์ˆ˜์‚ฌ์˜๋ฌธ๋ฌธ, ๊ทธ๋ฆฌ๊ณ  ์ˆ˜์‚ฌ๋ช…๋ น๋ฌธ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ฒŒ ํ•˜๋Š” ๊ธฐ์ค€์ด ๋œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ธฐ๋ก๋œ ์Œ์„ฑ ์–ธ์–ด(spoken language)๋ฅผ ์ถฉ๋ถ„ํžˆ ๋†’์€ ์ผ์น˜๋„(kappa = 0.85)๋กœ ์ฃผ์„ํ•œ ๋ง๋ญ‰์น˜๋ฅผ ์ด์šฉํ•ด, ์Œ์„ฑ์ด ์ฃผ์–ด์ง€์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ ์ค‘์˜์ ์ธ ํ…์ŠคํŠธ๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ๋ฐ์— ์–ด๋–ค ์ „๋žต ํ˜น์€ ์–ธ์–ด ๋ชจ๋ธ์ด ํšจ๊ณผ์ ์ธ๊ฐ€๋ฅผ ๋ณด์ด๊ณ , ํ•ด๋‹น ํƒœ์Šคํฌ์˜ ํŠน์ง•์„ ์ •์„ฑ์ ์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” ํ…์ŠคํŠธ ์ธต์œ„์—์„œ๋งŒ ์ค‘์˜์„ฑ์— ์ ‘๊ทผํ•˜์ง€ ์•Š๊ณ , ์‹ค์ œ๋กœ ์Œ์„ฑ์ด ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์ค‘์˜์„ฑ ํ•ด์†Œ(disambiguation)๊ฐ€ ๊ฐ€๋Šฅํ•œ์ง€๋ฅผ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด, ํ…์ŠคํŠธ๊ฐ€ ์ค‘์˜์ ์ธ ๋ฐœํ™”๋“ค๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ธ๊ณต์ ์ธ ์Œ์„ฑ ๋ง๋ญ‰์น˜๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์ง‘์ค‘(attention) ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ๋ง(neural network) ๋ชจ๋ธ๋“ค์„ ์ด์šฉํ•ด ์ค‘์˜์„ฑ์„ ํ•ด์†Œํ•œ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ํ†ต์‚ฌ์ /์˜๋ฏธ์  ์ค‘์˜์„ฑ ํ•ด์†Œ๊ฐ€ ์–ด๋– ํ•œ ๊ฒฝ์šฐ์— ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ์ง€ ๊ด€์ฐฐํ•˜๊ณ , ์ธ๊ฐ„์˜ ์–ธ์–ด ์ฒ˜๋ฆฌ์™€ ์–ด๋–ค ์—ฐ๊ด€์ด ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋งˆ์ง€๋ง‰์œผ๋กœ, ์œ„์™€ ๊ฐ™์€ ์ ˆ์ฐจ๋กœ ์˜๋„ ์ดํ•ด ๊ณผ์ •์—์„œ์˜ ์ค‘์˜์„ฑ์ด ํ•ด์†Œ๋˜์—ˆ์„ ๊ฒฝ์šฐ, ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ์‚ฐ์—…๊ณ„ ํ˜น์€ ์—ฐ๊ตฌ ๋‹จ์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ๊ฐ„๋žตํ•œ ๋กœ๋“œ๋งต์„ ์ œ์‹œํ•œ๋‹ค. ํ…์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ์ค‘์˜์„ฑ ํŒŒ์•…๊ณผ ์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ์˜๋„ ์ดํ•ด ๋ชจ๋“ˆ์„ ํ†ตํ•ฉํ•œ๋‹ค๋ฉด, ์˜ค๋ฅ˜์˜ ์ „ํŒŒ๋ฅผ ์ค„์ด๋ฉด์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ์ค‘์˜์„ฑ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ์Šคํ…œ์€ ๋Œ€ํ™” ๋งค๋‹ˆ์ €(dialogue manager)์™€ ํ†ตํ•ฉ๋˜์–ด ๊ฐ„๋‹จํ•œ ๋Œ€ํ™”(chit-chat)๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋ชฉ์  ์ง€ํ–ฅ ๋Œ€ํ™” ์‹œ์Šคํ…œ(task-oriented dialogue system)์„ ๊ตฌ์ถ•ํ•  ์ˆ˜๋„ ์žˆ๊ณ , ๋‹จ์ผ ์–ธ์–ด ์กฐ๊ฑด(monolingual condition)์„ ๋„˜์–ด ์Œ์„ฑ ๋ฒˆ์—ญ์—์„œ์˜ ์—๋Ÿฌ๋ฅผ ์ค„์ด๋Š” ๋ฐ์— ํ™œ์šฉ๋  ์ˆ˜๋„ ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ณธ๊ณ ๋ฅผ ํ†ตํ•ด, ์šด์œจ์— ๋ฏผ๊ฐํ•œ(prosody-sensitive) ์–ธ์–ด์—์„œ ์˜๋„ ์ดํ•ด๋ฅผ ์œ„ํ•œ ์ค‘์˜์„ฑ ํ•ด์†Œ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ด๋ฅผ ์‚ฐ์—… ๋ฐ ์—ฐ๊ตฌ ๋‹จ์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ด๊ณ ์ž ํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๊ฐ€ ๋‹ค๋ฅธ ์–ธ์–ด ๋ฐ ๋„๋ฉ”์ธ์—์„œ๋„ ๊ณ ์งˆ์ ์ธ ์ค‘์˜์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•˜๋Š” ๋ฐ์— ๋„์›€์ด ๋˜๊ธธ ๋ฐ”๋ผ๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ๋ฐ์— ํ™œ์šฉ๋œ ๋ฆฌ์†Œ์Šค, ๊ฒฐ๊ณผ๋ฌผ ๋ฐ ์ฝ”๋“œ๋“ค์„ ๊ณต์œ ํ•จ์œผ๋กœ์จ ํ•™๊ณ„์˜ ๋ฐœ์ „์— ์ด๋ฐ”์ง€ํ•˜๊ณ ์ž ํ•œ๋‹ค.Ambiguity in the language is inevitable. It is because, albeit language is a means of communication, a particular concept that everyone thinks of cannot be conveyed in a perfectly identical manner. As this is an inevitable factor, ambiguity in language understanding often leads to breakdown or failure of communication. There are various hierarchies of language ambiguity. However, not all ambiguity needs to be resolved. Different aspects of ambiguity exist for each domain and task, and it is crucial to define the boundary after recognizing the ambiguity that can be well-defined and resolved. In this dissertation, we investigate the types of ambiguity that appear in spoken language processing, especially in intention understanding, and conduct research to define and resolve it. Although this phenomenon occurs in various languages, its degree and aspect depend on the language investigated. The factor we focus on is cases where the ambiguity comes from the gap between the amount of information in the spoken language and the text. Here, we study the Korean language, which often shows different sentence structures and intentions depending on the prosody. In the Korean language, a text is often read with multiple intentions due to multi-functional sentence enders, frequent pro-drop, wh-intervention, etc. We first define this type of ambiguity and construct a corpus that helps detect ambiguous sentences, given that such utterances can be problematic for intention understanding. In constructing a corpus for intention understanding, we consider the directivity and rhetoricalness of a sentence. They make up a criterion for classifying the intention of spoken language into a statement, question, command, rhetorical question, and rhetorical command. Using the corpus annotated with sufficiently high agreement on a spoken language corpus, we show that colloquial corpus-based language models are effective in classifying ambiguous text given only textual data, and qualitatively analyze the characteristics of the task. We do not handle ambiguity only at the text level. To find out whether actual disambiguation is possible given a speech input, we design an artificial spoken language corpus composed only of ambiguous sentences, and resolve ambiguity with various attention-based neural network architectures. In this process, we observe that the ambiguity resolution is most effective when both textual and acoustic input co-attends each feature, especially when the audio processing module conveys attention information to the text module in a multi-hop manner. Finally, assuming the case that the ambiguity of intention understanding is resolved by proposed strategies, we present a brief roadmap of how the results can be utilized at the industry or research level. By integrating text-based ambiguity detection and speech-based intention understanding module, we can build a system that handles ambiguity efficiently while reducing error propagation. Such a system can be integrated with dialogue managers to make up a task-oriented dialogue system capable of chit-chat, or it can be used for error reduction in multilingual circumstances such as speech translation, beyond merely monolingual conditions. Throughout the dissertation, we want to show that ambiguity resolution for intention understanding in prosody-sensitive language can be achieved and can be utilized at the industry or research level. We hope that this study helps tackle chronic ambiguity issues in other languages โ€‹โ€‹or other domains, linking linguistic science and engineering approaches.1 Introduction 1 1.1 Motivation 2 1.2 Research Goal 4 1.3 Outline of the Dissertation 5 2 Related Work 6 2.1 Spoken Language Understanding 6 2.2 Speech Act and Intention 8 2.2.1 Performatives and statements 8 2.2.2 Illocutionary act and speech act 9 2.2.3 Formal semantic approaches 11 2.3 Ambiguity of Intention Understanding in Korean 14 2.3.1 Ambiguities in language 14 2.3.2 Speech act and intention understanding in Korean 16 3 Ambiguity in Intention Understanding of Spoken Language 20 3.1 Intention Understanding and Ambiguity 20 3.2 Annotation Protocol 23 3.2.1 Fragments 24 3.2.2 Clear-cut cases 26 3.2.3 Intonation-dependent utterances 28 3.3 Data Construction . 32 3.3.1 Source scripts 32 3.3.2 Agreement 32 3.3.3 Augmentation 33 3.3.4 Train split 33 3.4 Experiments and Results 34 3.4.1 Models 34 3.4.2 Implementation 36 3.4.3 Results 37 3.5 Findings and Summary 44 3.5.1 Findings 44 3.5.2 Summary 45 4 Disambiguation of Speech Intention 47 4.1 Ambiguity Resolution 47 4.1.1 Prosody and syntax 48 4.1.2 Disambiguation with prosody 50 4.1.3 Approaches in SLU 50 4.2 Dataset Construction 51 4.2.1 Script generation 52 4.2.2 Label tagging 54 4.2.3 Recording 56 4.3 Experiments and Results 57 4.3.1 Models 57 4.3.2 Results 60 4.4 Summary 63 5 System Integration and Application 65 5.1 System Integration for Intention Identification 65 5.1.1 Proof of concept 65 5.1.2 Preliminary study 69 5.2 Application to Spoken Dialogue System 75 5.2.1 What is 'Free-running' 76 5.2.2 Omakase chatbot 76 5.3 Beyond Monolingual Approaches 84 5.3.1 Spoken language translation 85 5.3.2 Dataset 87 5.3.3 Analysis 94 5.3.4 Discussion 95 5.4 Summary 100 6 Conclusion and Future Work 103 Bibliography 105 Abstract (In Korean) 124 Acknowledgment 126๋ฐ•

    Developing enhanced conversational agents for social virtual worlds

    Get PDF
    In This Paper, We Present A Methodology For The Development Of Embodied Conversational Agents For Social Virtual Worlds. The Agents Provide Multimodal Communication With Their Users In Which Speech Interaction Is Included. Our Proposal Combines Different Techniques Related To Artificial Intelligence, Natural Language Processing, Affective Computing, And User Modeling. A Statistical Methodology Has Been Developed To Model The System Conversational Behavior, Which Is Learned From An Initial Corpus And Improved With The Knowledge Acquired From The Successive Interactions. In Addition, The Selection Of The Next System Response Is Adapted Considering Information Stored Into User&#39 S Profiles And Also The Emotional Contents Detected In The User&#39 S Utterances. Our Proposal Has Been Evaluated With The Successful Development Of An Embodied Conversational Agent Which Has Been Placed In The Second Life Social Virtual World. The Avatar Includes The Different Models And Interacts With The Users Who Inhabit The Virtual World In Order To Provide Academic Information. The Experimental Results Show That The Agent&#39 S Conversational Behavior Adapts Successfully To The Specific Characteristics Of Users Interacting In Such Environments.Work partially supported by the Spanish CICyT Projects under grant TRA2015-63708-R and TRA2016-78886-C3-1-R
    • โ€ฆ
    corecore