468 research outputs found

    Detecting Interlocutor Confusion in Situated Human-Avatar Dialogue: A Pilot Study

    Get PDF
    In order to enhance levels of engagement with conversational systems, our long term research goal seeks to monitor the confusion state of a user and adapt dialogue policies in response to such user confusion states. To this end, in this paper, we present our initial research centred on a user-avatar dialogue scenario that we have developed to study the manifestation of confusion and in the long term its mitigation. We present a new definition of confusion that is particularly tailored to the requirements of intelligent conversational system development for task-oriented dialogue. We also present the details of our Wizard-of-Oz based data collection scenario wherein users interacted with a conversational avatar and were presented with stimuli that were in some cases designed to invoke a confused state in the user. Post study analysis of this data is also presented. Here, three pre-trained deep learning models were deployed to estimate base emotion, head pose and eye gaze. Despite a small pilot study group, our analysis demonstrates a significant relationship between these indicators and confusion states. We see this as a useful step forward in the automated analysis of the pragmatics of dialogue

    Confirmation Report: Modelling Interlocutor Confusion in Situated Human Robot Interaction

    Get PDF
    Human-Robot Interaction (HRI) is an important but challenging field focused on improving the interaction between humans and robots such to make the interaction more intelligent and effective. However, building a natural conversational HRI is an interdisciplinary challenge for scholars, engineers, and designers. It is generally assumed that the pinnacle of human- robot interaction will be having fluid naturalistic conversational interaction that in important ways mimics that of how humans interact with each other. This of course is challenging at a number of levels, and in particular there are considerable difficulties when it comes to naturally monitoring and responding to the user’s mental state. On the topic of mental states, one field that has received little attention to date is moni- toring the user for possible confusion states. Confusion is a non-trivial mental state which can be seen as having at least two substates. There two confusion states can be thought of as being associated with either negative or positive emotions. In the former, when people are productively confused, they have a passion to solve any current difficulties. Meanwhile, people who are in unproductive confusion may lose their engagement and motivation to overcome those difficulties, which in turn may even lead them to drop the current conversation. While there has been some research on confusion monitoring and detection, it has been limited with the most focused on evaluating confusion states in online learning tasks. The central hypothesis of this research is that the monitoring and detection of confusion states in users is essential to fluid task-centric HRI and that it should be possible to detect such confusion and adjust policies to mitigate the confusion in users. In this report, I expand on this hypothesis and set out several research questions. I also provide a comprehensive literature review before outlining work done to date towards my research hypothesis, I also set out plans for future experimental work

    EEG-based brain-computer interfaces using motor-imagery: techniques and challenges.

    Get PDF
    Electroencephalography (EEG)-based brain-computer interfaces (BCIs), particularly those using motor-imagery (MI) data, have the potential to become groundbreaking technologies in both clinical and entertainment settings. MI data is generated when a subject imagines the movement of a limb. This paper reviews state-of-the-art signal processing techniques for MI EEG-based BCIs, with a particular focus on the feature extraction, feature selection and classification techniques used. It also summarizes the main applications of EEG-based BCIs, particularly those based on MI data, and finally presents a detailed discussion of the most prevalent challenges impeding the development and commercialization of EEG-based BCIs

    Lindsey the Tour Guide Robot: Adaptive Long-Term Autonomy in Social Environments

    Get PDF
    This project proposes a framework for online adaptation of robot behaviours deployed autonomously in social settings with the goal of increasing the overall users' engagement during the interactions. One of the most critical aspects to address for robots deployed in ``the real world'' is the necessity of interacting with people, whether intentionally or not. Interacting with people requires a wide range of capabilities, from perceiving the different people's intentions and emotional states to generating appropriate behaviours for the specific context of the interaction. Moreover, it requires that robots learn and adapt from experience while interacting with their users. In this project, a mobile robot is embedded in a long-term study in a public museum. The robot has been deployed for more than a year, to date, as an autonomous tour guide to the museum's visitors, with its tasks being guiding people to the position of various exhibits and giving a description of each item. The long-term scenario allows studying how people interact with a robot in an unconstrained setting and give the opportunity of improving the current state-of-the-art robotics autonomy in a social setting. The initial data collection shows that users' engagement during the robotised tours steeply declines after the initial moments of the interaction. The first main contribution of this project is to investigate whether it is possible to automatically assess the users' engagement from the robot point-of-view during the interactions. A dataset of robot ego-centric videos was collected and manually annotated by independent coders with continuous engagement values. From it, an end-to-end regression model was trained to predict engagement from the robot point of view from a single camera. Experimental evaluation shows that the model accurately estimates the engagement level of people during an interaction, even in diverse environments and with different robots. Once the robot can detect the engagement state of users during the interactions, it can potentially plan tangential behaviours to influence the users' attentional state itself. The second contribution of this work is devising an online reinforcement learning algorithm that allows the robot to adapt its behaviour online from the feedback obtained during the interactions. The feedback is obtained from users' engagement values estimated from the robot head camera. In the experimental evaluation, the robot delivers the usual tours to the users with the difference that the choice of some actions is left to the adaptive learning algorithm. Results show that after a few months of exploration, the robot successfully learns a policy that leads people to stay in the interaction for longer

    A framework for emotion and sentiment predicting supported in ensembles

    Get PDF
    Humans are prepared to comprehend each other’s emotions through subtle body movements or facial expressions; using those expressions, individuals change how they deliver messages when communicating between them. Machines, user interfaces, or robots need to empower this ability, in a way to change the interaction from the traditional “human-computer interaction” to a “human-machine cooperation”, where the machine provides the “right” information and functionality, at the “right” time, and in the “right” way. This dissertation presents a framework for emotion classification based on facial, speech, and text emotion prediction sources, supported by an ensemble of open-source code retrieved from off-the-shelf available methods. The main contribution is integrating outputs from different sources and methods in a single prediction, consistent with the emotions presented by the system’s user. For each different source, an initial aggregation of primary classifiers was implemented: for facial emotion classification, the aggregation achieved an accuracy above 73% in both FER2013 and RAF-DB datasets; For the speech emotion classification, four datasets were used, namely: RAVDESS, TESS, CREMA-D, and SAVEE. The aggregation of primary classifiers, achieved for a combination of three of the mentioned datasets results above 86 % of accuracy; The text emotion aggregation of primary classifiers was tested with one dataset called EMOTIONLINES, the classification of emotions achieved an accuracy above 53 %. Finally, the integration of all the methods in a single framework allows us to develop an emotion multi-source aggregator (EMsA), which aggregates the results extracted from the primary emotion classifications from different sources, such as facial, speech, text etc. We describe the EMsA and results using the RAVDESS dataset, which achieved 81.99% accuracy, in the case of the EMsA using a combination of faces and speech. Finally, we present an initial approach for sentiment classification.Os humanos estão preparados para compreender as emoções uns dos outros por meio de movimentos subtis do corpo ou expressões faciais; i.e., a forma como esses movimentos e expressões são enviados mudam a forma de como são entregues as mensagens quando os humanos comunicam entre eles. Máquinas, interfaces de utilizador ou robôs precisam de potencializar essa capacidade, de forma a mudar a interação do tradicional “interação humano-computador” para uma “cooperação homem-máquina”, onde a máquina fornece as informações e funcionalidades “certas”, na hora “certa” e da maneira “certa”. Nesta dissertação é apresentada uma estrutura (um ensemble de modelos) para classificação de emoções baseada em múltiplas fontes, nomeadamente na previsão de emoções faciais, de fala e de texto. Os classificadores base são suportados em código-fonte aberto associados a métodos disponíveis na literatura (classificadores primários). A principal contribuição é integrar diferentes fontes e diferentes métodos (os classificadores primários) numa única previsão consistente com as emoções apresentadas pelo utilizador do sistema. Neste contexto, salienta-se que da análise ao estado da arte efetuada sobre as diferentes formas de classificar emoções em humanos, existe o reconhecimento de emoção corporal (não considerando a face). No entanto, não foi encontrado código-fonte aberto e publicado para os classificadores primários que possam ser utilizados no âmbito desta dissertação. No reconhecimento de emoções da fala e texto foram também encontradas algumas dificuldades em encontrar classificadores primários com os requisitos necessários, principalmente no texto, pois existem bastantes modelos, mas com inúmeras emoções diferentes das 6 emoções básicas consideradas (tristeza, medo, surpresa, repulsa, raiva e alegria). Para o texto ainda possível verificar que existem mais modelos com a previsão de sentimento do que de emoções. De forma isolada para cada uma das fontes, i.e., para cada componente analisada (face, fala e texto), foi desenvolvido uma framework em Python que implementa um agregador primário com n classificadores primários (nesta dissertação considerou-se n igual 3). Para executar os testes e obter os resultados de cada agregador primário é usado um dataset específico e é enviado a informação do dataset para o agregador. I.e., no caso do agregador facial é enviado uma imagem, no caso do agregador da fala é enviado um áudio e no caso do texto é enviado a frase para a correspondente framework. Cada dataset usado foi dividido em ficheiros treino, validação e teste. Quando a framework acaba de processar a informação recebida são gerados os respetivos resultados, nomeadamente: nome do ficheiro/identificação do input, resultados do primeiro classificador primário, resultados do segundo classificador primário, resultados do terceiro classificador primário e ground-truth do dataset. Os resultados dos classificadores primários são depois enviados para o classificador final desse agregador primário, onde foram testados quatro classificadores: (a) voting, que, no caso de n igual 3, consiste na comparação dos resultados da emoção de cada classificador primário, i.e., se 2 classificadores primários tiverem a mesma emoção o resultado do voting será esse, se todos os classificadores tiverem resultados diferentes nenhum resultado é escolhido. Além deste “classificador” foram ainda usados (b) Random Forest, (c) Adaboost e (d) MLP (multiplayer perceptron). Quando a framework de cada agregador primário foi concluída, foi desenvolvido um super-agregador que tem o mesmo princípio dos agregadores primários, mas, agora, em vez de ter os resultados/agregação de apenas 3 classificadores primários, vão existir n × 3 resultados de classificadores primários (n da face, n da fala e n do texto). Relativamente aos resultados dos agregadores usados para cada uma das fontes, face, fala e texto, obteve-se para a classificação de emoção facial uma precisão de classificação acima de 73% nos datasets FER2013 e RAF-DB. Na classificação da emoção da fala foram utilizados quatro datasets, nomeadamente RAVDESS, TESS, CREMA-D e SAVEE, tendo que o melhor resultado de precisão obtido foi acima dos 86% quando usado a combinação de 3 dos 4 datasets. Para a classificação da emoção do texto, testou-se com o um dataset EMOTIONLINES, sendo o melhor resultado obtido foi de 53% (precisão). A integração de todas os classificadores primários agora num único framework permitiu desenvolver o agregador multi-fonte (emotion multi-source aggregator - EMsA), onde a classificação final da emoção é extraída, como já referido da agregação dos classificadores de emoções primárias de diferentes fontes. Para EMsA são apresentados resultados usando o dataset RAVDESS, onde foi alcançado uma precisão de 81.99 %, no caso do EMsA usar uma combinação de faces e fala. Não foi possível testar EMsA usando um dataset reconhecido na literatura que tenha ao mesmo tempo informação do texto, face e fala. Por último, foi apresentada uma abordagem inicial para classificação de sentimentos

    Sensing, interpreting, and anticipating human social behaviour in the real world

    Get PDF
    Low-level nonverbal social signals like glances, utterances, facial expressions and body language are central to human communicative situations and have been shown to be connected to important high-level constructs, such as emotions, turn-taking, rapport, or leadership. A prerequisite for the creation of social machines that are able to support humans in e.g. education, psychotherapy, or human resources is the ability to automatically sense, interpret, and anticipate human nonverbal behaviour. While promising results have been shown in controlled settings, automatically analysing unconstrained situations, e.g. in daily-life settings, remains challenging. Furthermore, anticipation of nonverbal behaviour in social situations is still largely unexplored. The goal of this thesis is to move closer to the vision of social machines in the real world. It makes fundamental contributions along the three dimensions of sensing, interpreting and anticipating nonverbal behaviour in social interactions. First, robust recognition of low-level nonverbal behaviour lays the groundwork for all further analysis steps. Advancing human visual behaviour sensing is especially relevant as the current state of the art is still not satisfactory in many daily-life situations. While many social interactions take place in groups, current methods for unsupervised eye contact detection can only handle dyadic interactions. We propose a novel unsupervised method for multi-person eye contact detection by exploiting the connection between gaze and speaking turns. Furthermore, we make use of mobile device engagement to address the problem of calibration drift that occurs in daily-life usage of mobile eye trackers. Second, we improve the interpretation of social signals in terms of higher level social behaviours. In particular, we propose the first dataset and method for emotion recognition from bodily expressions of freely moving, unaugmented dyads. Furthermore, we are the first to study low rapport detection in group interactions, as well as investigating a cross-dataset evaluation setting for the emergent leadership detection task. Third, human visual behaviour is special because it functions as a social signal and also determines what a person is seeing at a given moment in time. Being able to anticipate human gaze opens up the possibility for machines to more seamlessly share attention with humans, or to intervene in a timely manner if humans are about to overlook important aspects of the environment. We are the first to propose methods for the anticipation of eye contact in dyadic conversations, as well as in the context of mobile device interactions during daily life, thereby paving the way for interfaces that are able to proactively intervene and support interacting humans.Blick, Gesichtsausdrücke, Körpersprache, oder Prosodie spielen als nonverbale Signale eine zentrale Rolle in menschlicher Kommunikation. Sie wurden durch vielzählige Studien mit wichtigen Konzepten wie Emotionen, Sprecherwechsel, Führung, oder der Qualität des Verhältnisses zwischen zwei Personen in Verbindung gebracht. Damit Menschen effektiv während ihres täglichen sozialen Lebens von Maschinen unterstützt werden können, sind automatische Methoden zur Erkennung, Interpretation, und Antizipation von nonverbalem Verhalten notwendig. Obwohl die bisherige Forschung in kontrollierten Studien zu ermutigenden Ergebnissen gekommen ist, bleibt die automatische Analyse nonverbalen Verhaltens in weniger kontrollierten Situationen eine Herausforderung. Darüber hinaus existieren kaum Untersuchungen zur Antizipation von nonverbalem Verhalten in sozialen Situationen. Das Ziel dieser Arbeit ist, die Vision vom automatischen Verstehen sozialer Situationen ein Stück weit mehr Realität werden zu lassen. Diese Arbeit liefert wichtige Beiträge zur autmatischen Erkennung menschlichen Blickverhaltens in alltäglichen Situationen. Obwohl viele soziale Interaktionen in Gruppen stattfinden, existieren unüberwachte Methoden zur Augenkontakterkennung bisher lediglich für dyadische Interaktionen. Wir stellen einen neuen Ansatz zur Augenkontakterkennung in Gruppen vor, welcher ohne manuelle Annotationen auskommt, indem er sich den statistischen Zusammenhang zwischen Blick- und Sprechverhalten zu Nutze macht. Tägliche Aktivitäten sind eine Herausforderung für Geräte zur mobile Augenbewegungsmessung, da Verschiebungen dieser Geräte zur Verschlechterung ihrer Kalibrierung führen können. In dieser Arbeit verwenden wir Nutzerverhalten an mobilen Endgeräten, um den Effekt solcher Verschiebungen zu korrigieren. Neben der Erkennung verbessert diese Arbeit auch die Interpretation sozialer Signale. Wir veröffentlichen den ersten Datensatz sowie die erste Methode zur Emotionserkennung in dyadischen Interaktionen ohne den Einsatz spezialisierter Ausrüstung. Außerdem stellen wir die erste Studie zur automatischen Erkennung mangelnder Verbundenheit in Gruppeninteraktionen vor, und führen die erste datensatzübergreifende Evaluierung zur Detektion von sich entwickelndem Führungsverhalten durch. Zum Abschluss der Arbeit präsentieren wir die ersten Ansätze zur Antizipation von Blickverhalten in sozialen Interaktionen. Blickverhalten hat die besondere Eigenschaft, dass es sowohl als soziales Signal als auch der Ausrichtung der visuellen Wahrnehmung dient. Somit eröffnet die Fähigkeit zur Antizipation von Blickverhalten Maschinen die Möglichkeit, sich sowohl nahtloser in soziale Interaktionen einzufügen, als auch Menschen zu warnen, wenn diese Gefahr laufen wichtige Aspekte der Umgebung zu übersehen. Wir präsentieren Methoden zur Antizipation von Blickverhalten im Kontext der Interaktion mit mobilen Endgeräten während täglicher Aktivitäten, als auch während dyadischer Interaktionen mittels Videotelefonie

    Applications of Affective Computing in Human-Robot Interaction: state-of-art and challenges for manufacturing

    Get PDF
    The introduction of collaborative robots aims to make production more flexible, promoting a greater interaction between humans and robots also from physical point of view. However, working closely with a robot may lead to the creation of stressful situations for the operator, which can negatively affect task performance. In Human-Robot Interaction (HRI), robots are expected to be socially intelligent, i.e., capable of understanding and reacting accordingly to human social and affective clues. This ability can be exploited implementing affective computing, which concerns the development of systems able to recognize, interpret, process, and simulate human affects. Social intelligence is essential for robots to establish a natural interaction with people in several contexts, including the manufacturing sector with the emergence of Industry 5.0. In order to take full advantage of the human-robot collaboration, the robotic system should be able to perceive the psycho-emotional and mental state of the operator through different sensing modalities (e.g., facial expressions, body language, voice, or physiological signals) and to adapt its behaviour accordingly. The development of socially intelligent collaborative robots in the manufacturing sector can lead to a symbiotic human-robot collaboration, arising several research challenges that still need to be addressed. The goals of this paper are the following: (i) providing an overview of affective computing implementation in HRI; (ii) analyzing the state-of-art on this topic in different application contexts (e.g., healthcare, service applications, and manufacturing); (iii) highlighting research challenges for the manufacturing sector

    Annotated Bibliography: Anticipation

    Get PDF
    corecore