21,437 research outputs found
Symbol Emergence in Robotics: A Survey
Humans can learn the use of language through physical interaction with their
environment and semiotic communication with other people. It is very important
to obtain a computational understanding of how humans can form a symbol system
and obtain semiotic skills through their autonomous mental development.
Recently, many studies have been conducted on the construction of robotic
systems and machine-learning methods that can learn the use of language through
embodied multimodal interaction with their environment and other systems.
Understanding human social interactions and developing a robot that can
smoothly communicate with human users in the long term, requires an
understanding of the dynamics of symbol systems and is crucially important. The
embodied cognition and social interaction of participants gradually change a
symbol system in a constructive manner. In this paper, we introduce a field of
research called symbol emergence in robotics (SER). SER is a constructive
approach towards an emergent symbol system. The emergent symbol system is
socially self-organized through both semiotic communications and physical
interactions with autonomous cognitive developmental agents, i.e., humans and
developmental robots. Specifically, we describe some state-of-art research
topics concerning SER, e.g., multimodal categorization, word discovery, and a
double articulation analysis, that enable a robot to obtain words and their
embodied meanings from raw sensory--motor information, including visual
information, haptic information, auditory information, and acoustic speech
signals, in a totally unsupervised manner. Finally, we suggest future
directions of research in SER.Comment: submitted to Advanced Robotic
MATT: Multimodal Attention Level Estimation for e-learning Platforms
This work presents a new multimodal system for remote attention level
estimation based on multimodal face analysis. Our multimodal approach uses
different parameters and signals obtained from the behavior and physiological
processes that have been related to modeling cognitive load such as faces
gestures (e.g., blink rate, facial actions units) and user actions (e.g., head
pose, distance to the camera). The multimodal system uses the following modules
based on Convolutional Neural Networks (CNNs): Eye blink detection, head pose
estimation, facial landmark detection, and facial expression features. First,
we individually evaluate the proposed modules in the task of estimating the
student's attention level captured during online e-learning sessions. For that
we trained binary classifiers (high or low attention) based on Support Vector
Machines (SVM) for each module. Secondly, we find out to what extent multimodal
score level fusion improves the attention level estimation. The mEBAL database
is used in the experimental framework, a public multi-modal database for
attention level estimation obtained in an e-learning environment that contains
data from 38 users while conducting several e-learning tasks of variable
difficulty (creating changes in student cognitive loads).Comment: Preprint of the paper presented to the Workshop on Artificial
Intelligence for Education (AI4EDU) of AAAI 202
How can a multimodal approach to primate communication help us understand the evolution of communication?
Scientists studying the communication of non-human animals are often aiming to better understand the evolution of human communication, including human language. Some scientists take a phylogenetic perspective, where the goal is to trace the evolutionary history of communicative traits, while others take a functional perspective, where the goal is to understand the selection pressures underpinning specific traits. Both perspectives are necessary to fully understand the evolution of communication, but it is important to understand how the two perspectives differ and what they can and cannot tell us. Here, we suggest that integrating phylogenetic and functional questions can be fruitful in better understanding the evolution of communication. We also suggest that adopting a multimodal approach to communication might help to integrate phylogenetic and functional questions, and provide an interesting avenue for research into language evolution
Early Turn-taking Prediction with Spiking Neural Networks for Human Robot Collaboration
Turn-taking is essential to the structure of human teamwork. Humans are
typically aware of team members' intention to keep or relinquish their turn
before a turn switch, where the responsibility of working on a shared task is
shifted. Future co-robots are also expected to provide such competence. To that
end, this paper proposes the Cognitive Turn-taking Model (CTTM), which
leverages cognitive models (i.e., Spiking Neural Network) to achieve early
turn-taking prediction. The CTTM framework can process multimodal human
communication cues (both implicit and explicit) and predict human turn-taking
intentions in an early stage. The proposed framework is tested on a simulated
surgical procedure, where a robotic scrub nurse predicts the surgeon's
turn-taking intention. It was found that the proposed CTTM framework
outperforms the state-of-the-art turn-taking prediction algorithms by a large
margin. It also outperforms humans when presented with partial observations of
communication cues (i.e., less than 40% of full actions). This early prediction
capability enables robots to initiate turn-taking actions at an early stage,
which facilitates collaboration and increases overall efficiency.Comment: Submitted to IEEE International Conference on Robotics and Automation
(ICRA) 201
An information assistant system for the prevention of tunnel vision in crisis management
In the crisis management environment, tunnel vision is a set of bias in decision makers’ cognitive process which often leads to incorrect understanding of the real crisis situation, biased perception of information, and improper decisions. The tunnel vision phenomenon is a consequence of both the challenges in the task and the natural limitation in a human being’s cognitive process. An information assistant system is proposed with the purpose of preventing tunnel vision. The system serves as a platform for monitoring the on-going crisis event. All information goes through the system before arrives at the user. The system enhances the data quality, reduces the data quantity and presents the crisis information in a manner that prevents or repairs the user’s cognitive overload. While working with such a system, the users (crisis managers) are expected to be more likely to stay aware of the actual situation, stay open minded to possibilities, and make proper decisions
Machine Understanding of Human Behavior
A widely accepted prediction is that computing will move to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. If this prediction is to come true, then next generation computing, which we will call human computing, should be about anticipatory user interfaces that should be human-centered, built for humans based on human models. They should transcend the traditional keyboard and mouse to include natural, human-like interactive functions including understanding and emulating certain human behaviors such as affective and social signaling. This article discusses a number of components of human behavior, how they might be integrated into computers, and how far we are from realizing the front end of human computing, that is, how far are we from enabling computers to understand human behavior
Robust Modeling of Epistemic Mental States
This work identifies and advances some research challenges in the analysis of
facial features and their temporal dynamics with epistemic mental states in
dyadic conversations. Epistemic states are: Agreement, Concentration,
Thoughtful, Certain, and Interest. In this paper, we perform a number of
statistical analyses and simulations to identify the relationship between
facial features and epistemic states. Non-linear relations are found to be more
prevalent, while temporal features derived from original facial features have
demonstrated a strong correlation with intensity changes. Then, we propose a
novel prediction framework that takes facial features and their nonlinear
relation scores as input and predict different epistemic states in videos. The
prediction of epistemic states is boosted when the classification of emotion
changing regions such as rising, falling, or steady-state are incorporated with
the temporal features. The proposed predictive models can predict the epistemic
states with significantly improved accuracy: correlation coefficient (CoERR)
for Agreement is 0.827, for Concentration 0.901, for Thoughtful 0.794, for
Certain 0.854, and for Interest 0.913.Comment: Accepted for Publication in Multimedia Tools and Application, Special
Issue: Socio-Affective Technologie
- …