Search CORE

2,555 research outputs found

Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

Author: A Stolcke
A Waibel
J Niehues
K Laskowski
M Schroder
N Srivastava
N Ward
X Glorot
Publication venue
Publication date: 02/06/2017
Field of study

Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39

arXiv.org e-Print Archive

Crossref

Low-level grounding in a multimodal mobile service robot conversational system using graphical models

Author: Alexander Anil
Drygajlo Andrzej
Prodanov Plamen
Richiardi Jonas
Publication venue
Publication date: 18/06/2018
Field of study

The main task of a service robot with a voice-enabled communication interface is to engage a user in dialogue providing an access to the services it is designed for. In managing such interaction, inferring the user goal (intention) from the request for a service at each dialogue turn is the key issue. In service robot deployment conditions speech recognition limitations with noisy speech input and inexperienced users may jeopardize user goal identification. In this paper, we introduce a grounding state-based model motivated by reducing the risk of communication failure due to incorrect user goal identification. The model exploits the multiple modalities available in the service robot system to provide evidence for reaching grounding states. In order to handle the speech input as sufficiently grounded (correctly understood) by the robot, four proposed states have to be reached. Bayesian networks combining speech and non-speech modalities during user goal identification are used to estimate probability that each grounding state has been reached. These probabilities serve as a base for detecting whether the user is attending to the conversation, as well as for deciding on an alternative input modality (e.g., buttons) when the speech modality is unreliable. The Bayesian networks used in the grounding model are specially designed for modularity and computationally efficient inference. The potential of the proposed model is demonstrated comparing a conversational system for the mobile service robot RoboX employing only speech recognition for user goal identification, and a system equipped with multimodal grounding. The evaluation experiments use component and system level metrics for technical (objective) and user-based (subjective) evaluation with multimodal data collected during the conversations of the robot RoboX with user

RERO DOC Digital Library

Neural Network Based Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction

Author: Horaud Radu
Lathuilière Stéphane
Massé Benoit
Mesejo Pablo
Publication venue: 'Elsevier BV'
Publication date: 23/04/2018
Field of study

This paper introduces a novel neural network-based reinforcement learning approach for robot gaze control. Our approach enables a robot to learn and to adapt its gaze control strategy for human-robot interaction neither with the use of external sensors nor with human supervision. The robot learns to focus its attention onto groups of people from its own audio-visual experiences, independently of the number of people, of their positions and of their physical appearances. In particular, we use a recurrent neural network architecture in combination with Q-learning to find an optimal action-selection policy; we pre-train the network using a simulated environment that mimics realistic scenarios that involve speaking/silent participants, thus avoiding the need of tedious sessions of a robot interacting with people. Our experimental evaluation suggests that the proposed method is robust against parameter estimation, i.e. the parameter values yielded by the method do not have a decisive impact on the performance. The best results are obtained when both audio and visual information is jointly used. Experiments with the Nao robot indicate that our framework is a step forward towards the autonomous learning of socially acceptable gaze behavior.Comment: Paper submitted to Pattern Recognition Letter

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Multimodal perception of histological images for persons blind or visually impaired

Author: Zhang Ting
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2014
Field of study

Currently there is no suitable substitute technology to enable blind or visually impaired (BVI) people to interpret visual scientific data commonly generated during lab experimentation in real time, such as performing light microscopy, spectrometry, and observing chemical reactions. This reliance upon visual interpretation of scientific data certainly impedes students and scientists that are BVI from advancing in careers in medicine, biology, chemistry, and other scientific fields. To address this challenge, a real-time multimodal image perception system is developed to transform standard laboratory blood smear images for persons with BVI to perceive, employing a combination of auditory, haptic, and vibrotactile feedbacks. These sensory feedbacks are used to convey visual information through alternative perceptual channels, thus creating a palette of multimodal, sensorial information. A Bayesian network is developed to characterize images through two groups of features of interest: primary and peripheral features. Causal relation links were established between these two groups of features. Then, a method was conceived for optimal matching between primary features and sensory modalities. Experimental results confirmed this real-time approach of higher accuracy in recognizing and analyzing objects within images compared to tactile images

Purdue E-Pubs

Embodied interaction with visualization and spatial navigation in time-sensitive scenarios

Author: Li Yu-Ting
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2014
Field of study

Paraphrasing the theory of embodied cognition, all aspects of our cognition are determined primarily by the contextual information and the means of physical interaction with data and information. In hybrid human-machine systems involving complex decision making, continuously maintaining a high level of attention while employing a deep understanding concerning the task performed as well as its context are essential. Utilizing embodied interaction to interact with machines has the potential to promote thinking and learning according to the theory of embodied cognition proposed by Lakoff. Additionally, the hybrid human-machine system utilizing natural and intuitive communication channels (e.g., gestures, speech, and body stances) should afford an array of cognitive benefits outstripping the more static forms of interaction (e.g., computer keyboard). This research proposes such a computational framework based on a Bayesian approach; this framework infers operator\u27s focus of attention based on the physical expressions of the operators. Specifically, this work aims to assess the effect of embodied interaction on attention during the solution of complex, time-sensitive, spatial navigational problems. Toward the goal of assessing the level of operator\u27s attention, we present a method linking the operator\u27s interaction utility, inference, and reasoning. The level of attention was inferred through networks coined Bayesian Attentional Networks (BANs). BANs are structures describing cause-effect relationships between operator\u27s attention, physical actions and decision-making. The proposed framework also generated a representative BAN, called the Consensus (Majority) Model (CMM); the CMM consists of an iteratively derived and agreed graph among candidate BANs obtained by experts and by the automatic learning process. Finally, the best combinations of interaction modalities and feedback were determined by the use of particular utility functions. This methodology was applied to a spatial navigational scenario; wherein, the operators interacted with dynamic images through a series of decision making processes. Real-world experiments were conducted to assess the framework\u27s ability to infer the operator\u27s levels of attention. Users were instructed to complete a series of spatial-navigational tasks using an assigned pairing of an interaction modality out of five categories (vision-based gesture, glove-based gesture, speech, feet, or body balance) and a feedback modality out of two (visual-based or auditory-based). Experimental results have confirmed that physical expressions are a determining factor in the quality of the solutions in a spatial navigational problem. Moreover, it was found that the combination of foot gestures with visual feedback resulted in the best task performance (p\u3c .001). Results have also shown that embodied interaction-based multimodal interface decreased execution errors that occurred in the cyber-physical scenarios (p \u3c .001). Therefore we conclude that appropriate use of interaction and feedback modalities allows the operators maintain their focus of attention, reduce errors, and enhance task performance in solving the decision making problems

Purdue E-Pubs

Multimedia Retrieval

Author
Publication venue: Springer
Publication date: 01/01/2007
Field of study

University of Twente Research Information

Discovery of Words: towards a Computational Model of Language Acquisition

Author: Bosch L.F.M. ten
Boves L.W.J.
Hamme H. Van
Mihelic F.
Zibert J.
Publication venue: 'IntechOpen'
Publication date: 01/01/2008
Field of study

IntechOpen

Crossref