2,555 research outputs found

    Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

    Full text link
    Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39

    Low-level grounding in a multimodal mobile service robot conversational system using graphical models

    Get PDF
    The main task of a service robot with a voice-enabled communication interface is to engage a user in dialogue providing an access to the services it is designed for. In managing such interaction, inferring the user goal (intention) from the request for a service at each dialogue turn is the key issue. In service robot deployment conditions speech recognition limitations with noisy speech input and inexperienced users may jeopardize user goal identification. In this paper, we introduce a grounding state-based model motivated by reducing the risk of communication failure due to incorrect user goal identification. The model exploits the multiple modalities available in the service robot system to provide evidence for reaching grounding states. In order to handle the speech input as sufficiently grounded (correctly understood) by the robot, four proposed states have to be reached. Bayesian networks combining speech and non-speech modalities during user goal identification are used to estimate probability that each grounding state has been reached. These probabilities serve as a base for detecting whether the user is attending to the conversation, as well as for deciding on an alternative input modality (e.g., buttons) when the speech modality is unreliable. The Bayesian networks used in the grounding model are specially designed for modularity and computationally efficient inference. The potential of the proposed model is demonstrated comparing a conversational system for the mobile service robot RoboX employing only speech recognition for user goal identification, and a system equipped with multimodal grounding. The evaluation experiments use component and system level metrics for technical (objective) and user-based (subjective) evaluation with multimodal data collected during the conversations of the robot RoboX with user

    Neural Network Based Reinforcement Learning for Audio-Visual Gaze Control in Human-Robot Interaction

    Get PDF
    This paper introduces a novel neural network-based reinforcement learning approach for robot gaze control. Our approach enables a robot to learn and to adapt its gaze control strategy for human-robot interaction neither with the use of external sensors nor with human supervision. The robot learns to focus its attention onto groups of people from its own audio-visual experiences, independently of the number of people, of their positions and of their physical appearances. In particular, we use a recurrent neural network architecture in combination with Q-learning to find an optimal action-selection policy; we pre-train the network using a simulated environment that mimics realistic scenarios that involve speaking/silent participants, thus avoiding the need of tedious sessions of a robot interacting with people. Our experimental evaluation suggests that the proposed method is robust against parameter estimation, i.e. the parameter values yielded by the method do not have a decisive impact on the performance. The best results are obtained when both audio and visual information is jointly used. Experiments with the Nao robot indicate that our framework is a step forward towards the autonomous learning of socially acceptable gaze behavior.Comment: Paper submitted to Pattern Recognition Letter

    Multimodal perception of histological images for persons blind or visually impaired

    Get PDF
    Currently there is no suitable substitute technology to enable blind or visually impaired (BVI) people to interpret visual scientific data commonly generated during lab experimentation in real time, such as performing light microscopy, spectrometry, and observing chemical reactions. This reliance upon visual interpretation of scientific data certainly impedes students and scientists that are BVI from advancing in careers in medicine, biology, chemistry, and other scientific fields. To address this challenge, a real-time multimodal image perception system is developed to transform standard laboratory blood smear images for persons with BVI to perceive, employing a combination of auditory, haptic, and vibrotactile feedbacks. These sensory feedbacks are used to convey visual information through alternative perceptual channels, thus creating a palette of multimodal, sensorial information. A Bayesian network is developed to characterize images through two groups of features of interest: primary and peripheral features. Causal relation links were established between these two groups of features. Then, a method was conceived for optimal matching between primary features and sensory modalities. Experimental results confirmed this real-time approach of higher accuracy in recognizing and analyzing objects within images compared to tactile images

    Embodied interaction with visualization and spatial navigation in time-sensitive scenarios

    Get PDF
    Paraphrasing the theory of embodied cognition, all aspects of our cognition are determined primarily by the contextual information and the means of physical interaction with data and information. In hybrid human-machine systems involving complex decision making, continuously maintaining a high level of attention while employing a deep understanding concerning the task performed as well as its context are essential. Utilizing embodied interaction to interact with machines has the potential to promote thinking and learning according to the theory of embodied cognition proposed by Lakoff. Additionally, the hybrid human-machine system utilizing natural and intuitive communication channels (e.g., gestures, speech, and body stances) should afford an array of cognitive benefits outstripping the more static forms of interaction (e.g., computer keyboard). This research proposes such a computational framework based on a Bayesian approach; this framework infers operator\u27s focus of attention based on the physical expressions of the operators. Specifically, this work aims to assess the effect of embodied interaction on attention during the solution of complex, time-sensitive, spatial navigational problems. Toward the goal of assessing the level of operator\u27s attention, we present a method linking the operator\u27s interaction utility, inference, and reasoning. The level of attention was inferred through networks coined Bayesian Attentional Networks (BANs). BANs are structures describing cause-effect relationships between operator\u27s attention, physical actions and decision-making. The proposed framework also generated a representative BAN, called the Consensus (Majority) Model (CMM); the CMM consists of an iteratively derived and agreed graph among candidate BANs obtained by experts and by the automatic learning process. Finally, the best combinations of interaction modalities and feedback were determined by the use of particular utility functions. This methodology was applied to a spatial navigational scenario; wherein, the operators interacted with dynamic images through a series of decision making processes. Real-world experiments were conducted to assess the framework\u27s ability to infer the operator\u27s levels of attention. Users were instructed to complete a series of spatial-navigational tasks using an assigned pairing of an interaction modality out of five categories (vision-based gesture, glove-based gesture, speech, feet, or body balance) and a feedback modality out of two (visual-based or auditory-based). Experimental results have confirmed that physical expressions are a determining factor in the quality of the solutions in a spatial navigational problem. Moreover, it was found that the combination of foot gestures with visual feedback resulted in the best task performance (p\u3c .001). Results have also shown that embodied interaction-based multimodal interface decreased execution errors that occurred in the cyber-physical scenarios (p \u3c .001). Therefore we conclude that appropriate use of interaction and feedback modalities allows the operators maintain their focus of attention, reduce errors, and enhance task performance in solving the decision making problems

    Multimedia Retrieval

    Get PDF
    • …
    corecore