Human-robot interaction system based on multimodal and adaptive dialogs

Alonso Martín, Fernando

thesis

Human-robot interaction system based on multimodal and adaptive dialogs

Authors: Fernando Alonso Martín
Publication date: 1 January 2014
Publisher

Abstract

Mención Internacional en el título de doctorDurante los últimos años, en el área de la Interacción Humano-Robot (HRI), ha sido creciente el estudio de la interacción en la que participan usuarios no entrenados tecnológicamente con sistemas robóticos. Para esta población de usuarios potenciales, es necesario utilizar técnicas de interacción que no precisen de conocimientos previos específicos. En este sentido, al usuario no se le debe presuponer ningún tipo de habilidad tecnológica: la única habilidad interactiva que se le puede presuponer al usuario es la que le permite interaccionar con otros humanos. Las técnicas desarrolladas y expuestas en este trabajo tienen como finalidad, por un lado que el sistema/robot se exprese de modo y manera que esos usuarios puedan comprenderlo, sin necesidad de hacer un esfuerzo extra con respecto a la interacción con personas. Por otro lado, que el sistema/robot interprete lo que esos usuarios expresen sin que tengan que hacerlo de modo distinto a como lo harían para comunicarse con otra persona. En definitiva, se persigue imitar a los seres humanos en su manera de interactuar. En la presente se ha desarrollado y probado un sistema de interacción natural, que se ha denominado Robotics Dialog System (RDS). Permite una interacción entre el robot y el usuario usando los diversos canales de comunicación disponibles. El sistema completo consta de diversos módulos, que trabajando de una manera coordinada y complementaria, trata de alcanzar los objetivos de interacción natural deseados. RDS convive dentro de una arquitectura de control robótica y se comunica con el resto de sistemas que la componen, como son los sistemas de: toma de decisiones, secuenciación, comunicación, juegos, percepción sensoriales, expresión, etc. La aportación de esta tesis al avance del estado del arte, se produce a dos niveles. En un plano superior, se presenta el sistema de interacción humano-robot (RDS) mediante diálogos multimodales. En un plano inferior, en cada capítulo se describen los componentes desarrollados expresamente para el sistema RDS, realizando contribuciones al estado del arte en cada campo tratado. Previamente a cada aportación realizada, ha sido necesario integrar y/o implementar los avances acaecidos en su estado del arte hasta la fecha. La mayoría de estas contribuciones, se encuentran respaldadas mediante publicación en revistas científicas. En el primer campo en el que se trabajó, y que ha ido evolucionando durante todo el proceso de investigación, fue en el campo del Procesamiento del Lenguaje Natural. Se ha analizado y experimentado en situaciones reales, los sistemas más importantes de reconocimiento de voz (ASR); posteriormente, algunos de ellos han sido integrados en el sistema RDS, mediante un sistema que trabaja concurrentemente con varios motores de ASR, con el doble objetivo de mejorar la precisión en el reconocimiento de voz y proporcionar varios métodos de entrada de información complementarios. Continuó la investigación, adaptando la interacción a los posibles tipos de micrófonos y entornos acústicos. Se complementó el sistema con la capacidad de reconocer voz en múltiples idiomas y de identificar al usuario por su tono de voz. El siguiente campo de investigación tratado corresponde con la generación de lenguaje natural. El objetivo ha sido lograr un sistema de síntesis verbal con cierto grado de naturalidad e inteligibilidad, multilenguaje, con varios timbres de voz, y que expresase emociones. Se construyó un sistema modular capaz de integrar varios motores de síntesis de voz. Para dotar al sistema de cierta naturalidad y variabilidad expresiva, se incorporó un mecanismo de plantillas, que permite sintetizar voz con cierto grado de variabilidad léxica. La gestión del diálogo constituyo el siguiente reto. Se analizaron los paradigmas existentes, y se escogió un gestor basado en huecos de información. El gestor escogido se amplió y modificó para potenciar la capacidad de adaptarse al usuario (mediante perfiles) y tener cierto conocimiento del mundo. Conjuntamente, se desarrollo el módulo de fusión multimodal, que se encarga de abstraer la multimodalidad al gestor del diálogo, es decir, de abstraer al gestor del diálogo de los canales por los que se recibe el mensaje comunicativo. Este módulo, surge como el resultado de adaptar la teoría de actos comunicativos en la interacción entre humanos a nuestro sistema de interacción. Su función es la de empaquetar la información sensorial emitida por los módulos sensoriales de RDS (siguiendo un algoritmo de detección de actos comunicativos, desarrollado para este trabajo), y entregarlos al gestor del diálogo en cada turno del diálogo. Para potenciar la multimodalidad, se añadieron nuevos modos de entrada al sistema. El sistema de localización de usuarios, que en base al análisis de varias entradas de información, entre ellas la sonora, consigue identificar y localizar los usuarios que rodean al robot. La gestión de las emociones del robot y del usuario también forman parte de las modos de entradas del sistema, para ello, la emoción del robot se genera mediante un módulo externo de toma de decisiones, mientras que la emoción del usuario es percibida mediante el análisis de las características sonoras de su voz y de las expresiones de su rostro. Por último, otras modos de entrada incorporados han sido la lectura de etiquetas de radio frecuencia, y la lectura de texto escrito. Por otro lado, se desarrollaron nuevos modos expresivos o de salida. Entre ellos destacan la expresión de sonidos no-verbales generados en tiempo real, la capacidad de cantar, y de expresar ciertos gestos “de enganche” que ayudan a mejorar la naturalidad de la interacción: mirar al usuario, afirmaciones y negaciones con la cabeza, etc.In recent years, in the Human-Robot Interaction (HRI) area, there has been more interest in situations where users are not technologically skilled with robotic systems. For these users, it is necessary to use interactive techniques that don’t require previous specific knowledge. Any technological skill must not be assumed for them; the only one permitted is to communicate with other human users. The techniques that will be shown in this work have the goal that the robot or system displays information in a way that these users can understand it perfectly. In other words, in the same way they would do with any other human, and the robot or system understands what users are expressing. To sum up, the goal is to emulate how humans are interacting. In this thesis a natural interaction system has been developed and tested, it has been called Robotics Dialog System (RDS). It allows users and robotic communication using different channels. The system is comprised of many modules that work together co-ordinately to reach the desired natural interactivity levels. It has been designed inside a robotic control architecture and communicates with all the other systems: decision management system, sequencer, communication system, games, sensorial and movement skills, etc. This thesis contributes to the state-of-the-art in two levels. First, in a high level, it is shown a Human-Robot Interaction System (RDS) with multimodal dialogs. Second, in the lower level, in each chapter the specifically designed components for this RDS system will be described. All of them will contribute to the state-of-the-art individually to their scientific subject. Before each contribution it has been necessary to update them, either by integrating or implementing the state-ofthe- art techniques. Most of them have been checked with scientific journal papers. The first works were done in the Natural Language Processing system. Analysis and experiments have been carried out with the most important existing voice recognition systems (ASR) in daily real situations. Then, some of them have been added into the RDS system in a way that they are able to work concurrently, the goal was to enhance the voice recognition precision and enable several complementary input methods. Then, the research focus was move to adapt the interaction between several types of microphones and acoustic environments. Finally, the system was extended to be able to identify several languages and users, using for this later their voice tone. The next system to be focused was the natural language generator, whose main objectives within this thesis boundaries were to reach a certain level of intelligence and naturalness, to be multilingual, to have several voice tones and to express emotions. The system architecture was designed to be comprised of several modules and abstraction layers because several voice synthesis engines needed to be integrated. A pattern-based mechanism was also added to the system in order to give it some natural variability and to generate non-predefined sentences in a conversation. Then the Dialog Management System (DMS) was the next challenge. First of all, the existing paradigms whose behaviour is based in filling information gaps were analysed to choose the best one. Secondly, the system was modified and tailored to be adapted to users (by means of user profiling) and finally, some general knowledge was added (by using pre-defined files). At the same time the Multi-modal Module was developed. Its goal is to abstract this multi-modality from the DMS, in other words, the DMS system must use the message regardless the input channel the message used to reach it. This module was created as a result of adapting the communicative act theory in interactions between human beings to our interaction system. Its main function is to gather the information from the RDS sensorial modules (following an ad-hoc communicative act detection algorithm developed for this work) and to send them to the DMS at every step of the communicative process. New modes were integrated on the system to enhance this multi-modality such as the user location system, which allows the robot to know the position around it where the users are located by analysing a set of inputs, including sound. Other modes added to the system are the radio frequency tag reader and the written text reader. In addition, the robot and user emotion management have been added to the available inputs, and then, taken into account. To fulfil this requirement, the robot emotions are generated by an external decision-maker software module while the user emotions are captured by means of acoustic voice analysis and artificial vision techniques applied to the user face. Finally, new multi-modal expressive components, which make the interaction more natural, were developed: the capacity of generating non-textual real-time sounds, singing skills and some other gestures such as staring at the user, nodding, etc.Programa Oficial de Doctorado en Ingeniería Eléctrica, Electrónica y AutomáticaPresidente: Carlos Balaguer Bernaldo de Quirós.- Vocal: Antonio Barrientos Cru