87 research outputs found
A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems
This survey provides a comprehensive review of research on multi-turn
dialogue systems, with a particular focus on multi-turn dialogue systems based
on large language models (LLMs). This paper aims to (a) give a summary of
existing LLMs and approaches for adapting LLMs to downstream tasks; (b)
elaborate recent advances in multi-turn dialogue systems, covering both
LLM-based open-domain dialogue (ODD) and task-oriented dialogue (TOD) systems,
along with datasets and evaluation metrics; (c) discuss some future emphasis
and recent research problems arising from the development of LLMs and the
increasing demands on multi-turn dialogue systems.Comment: 35 pages, 10 figures, ACM Computing Survey
Can Large Language Models Be Good Companions? An LLM-Based Eyewear System with Conversational Common Ground
Developing chatbots as personal companions has long been a goal of artificial
intelligence researchers. Recent advances in Large Language Models (LLMs) have
delivered a practical solution for endowing chatbots with anthropomorphic
language capabilities. However, it takes more than LLMs to enable chatbots that
can act as companions. Humans use their understanding of individual
personalities to drive conversations. Chatbots also require this capability to
enable human-like companionship. They should act based on personalized,
real-time, and time-evolving knowledge of their owner. We define such essential
knowledge as the \textit{common ground} between chatbots and their owners, and
we propose to build a common-ground-aware dialogue system from an LLM-based
module, named \textit{OS-1}, to enable chatbot companionship. Hosted by
eyewear, OS-1 can sense the visual and audio signals the user receives and
extract real-time contextual semantics. Those semantics are categorized and
recorded to formulate historical contexts from which the user's profile is
distilled and evolves over time, i.e., OS-1 gradually learns about its user.
OS-1 combines knowledge from real-time semantics, historical contexts, and
user-specific profiles to produce a common-ground-aware prompt input into the
LLM module. The LLM's output is converted to audio, spoken to the wearer when
appropriate.We conduct laboratory and in-field studies to assess OS-1's ability
to build common ground between the chatbot and its user. The technical
feasibility and capabilities of the system are also evaluated. OS-1, with its
common-ground awareness, can significantly improve user satisfaction and
potentially lead to downstream tasks such as personal emotional support and
assistance.Comment: 36 pages, 25 figures, Under review at ACM IMWU
Toward Multi-modal Multi-aspect Deep Alignment and Integration
Multi-modal/-aspect data contains complementary information about the same thing of interest that
has the promising potential of leading to improved model robustness and thus gaining an increasing
research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/-
aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data
from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that
handle data with different aspects represented by the same media form, such as the syntactic and
semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/-
aspect simply tackle the cross-modal/-aspect alignment and integration through various deep
learning neural networks in an implicit manner and optimize based on the final task goals, leaving the
potential strategies for improving the cross-modal/-aspect alignment and integration under-explored.
This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect
deep alignment and integration. By looking into the limitations of existing approaches for both
heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel
strategies and approaches for improving the cross-modal/-aspect alignment and integration and
evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal
information captured graph-structured representation learning approach is proposed to enforce better
cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language
scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration
mechanism is explored to synthesise the multi-level semantics for comprehensive text
understanding, which is validated in the joint multi-aspect natural language understanding context
and its generalised text understanding setting
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Human-Robot Interaction architecture for interactive and lively social robots
Mención Internacional en el título de doctorLa sociedad está experimentando un proceso de envejecimiento que puede provocar un desequilibrio
entre la población en edad de trabajar y aquella fuera del mercado de trabajo. Una de las soluciones
a este problema que se están considerando hoy en día es la introducción de robots en multiples
sectores, incluyendo el de servicios. Sin embargo, para que esto sea una solución viable, estos robots
necesitan ser capaces de interactuar con personas de manera satisfactoria, entre otras habilidades. En
el contexto de la aplicación de robots sociales al cuidado de mayores, esta tesis busca proporcionar
a un robot social las habilidades necesarias para crear interacciones entre humanos y robots que
sean naturales. En concreto, esta tesis se centra en tres problemas que deben ser solucionados: (i) el
modelado de interacciones entre humanos y robots; (ii) equipar a un robot social con las capacidades
expresivas necesarias para una comunicación satisfactoria; y (iii) darle al robot una apariencia vivaz.
La solución al problema de modelado de diálogos presentada en esta tesis propone diseñar estos
diálogos como una secuencia de elementos atómicos llamados Actos Comunicativos (CAs, por sus
siglas en inglés). Se pueden parametrizar en tiempo de ejecución para completar diferentes objetivos
comunicativos, y están equipados con mecanismos para manejar algunas de las imprecisiones que
pueden aparecer durante interacciones. Estos CAs han sido identificados a partir de la combinación
de dos dimensiones: iniciativa (si la tiene el robot o el usuario) e intención (si se pretende obtener o
proporcionar información). Estos CAs pueden ser combinados siguiendo una estructura jerárquica
para crear estructuras mas complejas que sean reutilizables. Esto simplifica el proceso para crear
nuevas interacciones, permitiendo a los desarrolladores centrarse exclusivamente en diseñar el flujo
del diálogo, sin tener que preocuparse de reimplementar otras funcionalidades que tienen que estar
presentes en todas las interacciones (como el manejo de errores, por ejemplo).
La expresividad del robot está basada en el uso de una librería de gestos, o expresiones,
multimodales predefinidos, modelados como estructuras similares a máquinas de estados. El
módulo que controla la expresividad recibe peticiones para realizar dichas expresiones, planifica
su ejecución para evitar cualquier conflicto que pueda aparecer, las carga, y comprueba que su
ejecución se complete sin problemas. El sistema es capaz también de generar estas expresiones en
tiempo de ejecución a partir de una lista de acciones unimodales (como decir una frase, o mover una
articulación). Una de las características más importantes de la arquitectura de expresividad propuesta
es la integración de una serie de métodos de modulación que pueden ser usados para modificar los
gestos del robot en tiempo de ejecución. Esto permite al robot adaptar estas expresiones en base
a circunstancias particulares (aumentando al mismo tiempo la variabilidad de la expresividad del robot), y usar un número limitado de gestos para mostrar diferentes estados internos (como el estado
emocional).
Teniendo en cuenta que ser reconocido como un ser vivo es un requisito para poder participar en
interacciones sociales, que un robot social muestre una apariencia de vivacidad es un factor clave
en interacciones entre humanos y robots. Para ello, esta tesis propone dos soluciones. El primer
método genera acciones a través de las diferentes interfaces del robot a intervalos. La frecuencia e
intensidad de estas acciones están definidas en base a una señal que representa el pulso del robot.
Dicha señal puede adaptarse al contexto de la interacción o al estado interno del robot. El segundo
método enriquece las interacciones verbales entre el robot y el usuario prediciendo los gestos no
verbales más apropiados en base al contenido del diálogo y a la intención comunicativa del robot.
Un modelo basado en aprendizaje automático recibe la transcripción del mensaje verbal del robot,
predice los gestos que deberían acompañarlo, y los sincroniza para que cada gesto empiece en el
momento preciso. Este modelo se ha desarrollado usando una combinación de un encoder diseñado
con una red neuronal Long-Short Term Memory, y un Conditional Random Field para predecir la
secuencia de gestos que deben acompañar a la frase del robot.
Todos los elementos presentados conforman el núcleo de una arquitectura de interacción
humano-robot modular que ha sido integrada en múltiples plataformas, y probada bajo diferentes
condiciones. El objetivo central de esta tesis es contribuir al área de interacción humano-robot
con una nueva solución que es modular e independiente de la plataforma robótica, y que se centra
en proporcionar a los desarrolladores las herramientas necesarias para desarrollar aplicaciones que
requieran interacciones con personas.Society is experiencing a series of demographic changes that can result in an unbalance between
the active working and non-working age populations. One of the solutions considered to mitigate
this problem is the inclusion of robots in multiple sectors, including the service sector. But for
this to be a viable solution, among other features, robots need to be able to interact with humans
successfully. This thesis seeks to endow a social robot with the abilities required for a natural
human-robot interactions. The main objective is to contribute to the body of knowledge on the area
of Human-Robot Interaction with a new, platform-independent, modular approach that focuses on
giving roboticists the tools required to develop applications that involve interactions with humans. In
particular, this thesis focuses on three problems that need to be addressed: (i) modelling interactions
between a robot and an user; (ii) endow the robot with the expressive capabilities required for a
successful communication; and (iii) endow the robot with a lively appearance.
The approach to dialogue modelling presented in this thesis proposes to model dialogues as a
sequence of atomic interaction units, called Communicative Acts, or CAs. They can be parametrized
in runtime to achieve different communicative goals, and are endowed with mechanisms oriented to
solve some of the uncertainties related to interaction. Two dimensions have been used to identify the
required CAs: initiative (the robot or the user), and intention (either retrieve information or to convey
it). These basic CAs can be combined in a hierarchical manner to create more re-usable complex
structures. This approach simplifies the creation of new interactions, by allowing developers to focus
exclusively on designing the flow of the dialogue, without having to re-implement functionalities
that are common to all dialogues (like error handling, for example).
The expressiveness of the robot is based on the use of a library of predefined multimodal gestures,
or expressions, modelled as state machines. The module managing the expressiveness receives requests
for performing gestures, schedules their execution in order to avoid any possible conflict that might
arise, loads them, and ensures that their execution goes without problems. The proposed approach
is also able to generate expressions in runtime based on a list of unimodal actions (an utterance,
the motion of a limb, etc...). One of the key features of the proposed expressiveness management
approach is the integration of a series of modulation techniques that can be used to modify the
robot’s expressions in runtime. This would allow the robot to adapt them to the particularities of a
given situation (which would also increase the variability of the robot expressiveness), and to display
different internal states with the same expressions. Considering that being recognized as a living being is a requirement for engaging in social
encounters, the perception of a social robot as a living entity is a key requirement to foster
human-robot interactions. In this dissertation, two approaches have been proposed. The first
method generates actions for the different interfaces of the robot at certain intervals. The frequency
and intensity of these actions are defined by a signal that represents the pulse of the robot, which can
be adapted to the context of the interaction or the internal state of the robot. The second method
enhances the robot’s utterance by predicting the appropriate non-verbal expressions that should
accompany them, according to the content of the robot’s message, as well as its communicative
intention. A deep learning model receives the transcription of the robot’s utterances, predicts
which expressions should accompany it, and synchronizes them, so each gesture selected starts at
the appropriate time. The model has been developed using a combination of a Long-Short Term
Memory network-based encoder and a Conditional Random Field for generating a sequence of
gestures that are combined with the robot’s utterance.
All the elements presented above conform the core of a modular Human-Robot Interaction
architecture that has been integrated in multiple platforms, and tested under different conditions.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidente: Fernando Torres Medina.- Secretario: Concepción Alicia Monje Micharet.- Vocal: Amirabdollahian Farshi
A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications
Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems
Students´ language in computer-assisted tutoring of mathematical proofs
Truth and proof are central to mathematics. Proving (or disproving) seemingly simple statements often turns out to be one of the hardest mathematical tasks. Yet, doing proofs is rarely taught in the classroom. Studies on cognitive difficulties in learning to do proofs have shown that pupils and students not only often do not understand or cannot apply basic formal reasoning techniques and do not know how to use formal mathematical language, but, at a far more fundamental level, they also do not understand what it means to prove a statement or even do not see the purpose of proof at all. Since insight into the importance of proof and doing proofs as such cannot be learnt other than by practice, learning support through individualised tutoring is in demand.
This volume presents a part of an interdisciplinary project, set at the intersection of pedagogical science, artificial intelligence, and (computational) linguistics, which investigated issues involved in provisioning computer-based tutoring of mathematical proofs through dialogue in natural language. The ultimate goal in this context, addressing the above-mentioned need for learning support, is to build intelligent automated tutoring systems for mathematical proofs. The research presented here has been focused on the language that students use while interacting with such a system: its linguistic propeties and computational modelling. Contribution is made at three levels: first, an analysis of language phenomena found in students´ input to a (simulated) proof tutoring system is conducted and the variety of students´ verbalisations is quantitatively assessed, second, a general computational processing strategy for informal mathematical language and methods of modelling prominent language phenomena are proposed, and third, the prospects for natural language as an input modality for proof tutoring systems is evaluated based on collected corpora
Design of a Controlled Language for Critical Infrastructures Protection
We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates
from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically
represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of
traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an
analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen
Designing Embodied Interactive Software Agents for E-Learning: Principles, Components, and Roles
Embodied interactive software agents are complex autonomous, adaptive, and social software systems with a digital embodiment that enables them to act on and react to other entities (users, objects, and other agents) in their environment through bodily actions, which include the use of verbal and non-verbal communicative behaviors in face-to-face interactions with the user. These agents have been developed for various roles in different application domains, in which they perform tasks that have been assigned to them by their developers or delegated to them by their users or by other agents. In computer-assisted learning, embodied interactive pedagogical software agents have the general task to promote human learning by working with students (and other agents) in computer-based learning environments, among them e-learning platforms based on Internet technologies, such as the Virtual Linguistics Campus (www.linguistics-online.com). In these environments, pedagogical agents provide contextualized, qualified, personalized, and timely assistance, cooperation, instruction, motivation, and services for both individual learners and groups of learners.
This thesis develops a comprehensive, multidisciplinary, and user-oriented view of the design of embodied interactive pedagogical software agents, which integrates theoretical and practical insights from various academic and other fields. The research intends to contribute to the scientific understanding of issues, methods, theories, and technologies that are involved in the design, implementation, and evaluation of embodied interactive software agents for different roles in e-learning and other areas. For developers, the thesis provides sixteen basic principles (Added Value, Perceptible Qualities, Balanced Design, Coherence, Consistency, Completeness, Comprehensibility, Individuality, Variability, Communicative Ability, Modularity, Teamwork, Participatory Design, Role Awareness, Cultural Awareness, and Relationship Building) plus a large number of specific guidelines for the design of embodied interactive software agents and their components. Furthermore, it offers critical reviews of theories, concepts, approaches, and technologies from different areas and disciplines that are relevant to agent design. Finally, it discusses three pedagogical agent roles (virtual native speaker, coach, and peer) in the scenario of the linguistic fieldwork classes on the Virtual Linguistics Campus and presents detailed considerations for the design of an agent for one of these roles (the virtual native speaker)
- …