294 research outputs found

    Sensing, interpreting, and anticipating human social behaviour in the real world

    Get PDF
    Low-level nonverbal social signals like glances, utterances, facial expressions and body language are central to human communicative situations and have been shown to be connected to important high-level constructs, such as emotions, turn-taking, rapport, or leadership. A prerequisite for the creation of social machines that are able to support humans in e.g. education, psychotherapy, or human resources is the ability to automatically sense, interpret, and anticipate human nonverbal behaviour. While promising results have been shown in controlled settings, automatically analysing unconstrained situations, e.g. in daily-life settings, remains challenging. Furthermore, anticipation of nonverbal behaviour in social situations is still largely unexplored. The goal of this thesis is to move closer to the vision of social machines in the real world. It makes fundamental contributions along the three dimensions of sensing, interpreting and anticipating nonverbal behaviour in social interactions. First, robust recognition of low-level nonverbal behaviour lays the groundwork for all further analysis steps. Advancing human visual behaviour sensing is especially relevant as the current state of the art is still not satisfactory in many daily-life situations. While many social interactions take place in groups, current methods for unsupervised eye contact detection can only handle dyadic interactions. We propose a novel unsupervised method for multi-person eye contact detection by exploiting the connection between gaze and speaking turns. Furthermore, we make use of mobile device engagement to address the problem of calibration drift that occurs in daily-life usage of mobile eye trackers. Second, we improve the interpretation of social signals in terms of higher level social behaviours. In particular, we propose the first dataset and method for emotion recognition from bodily expressions of freely moving, unaugmented dyads. Furthermore, we are the first to study low rapport detection in group interactions, as well as investigating a cross-dataset evaluation setting for the emergent leadership detection task. Third, human visual behaviour is special because it functions as a social signal and also determines what a person is seeing at a given moment in time. Being able to anticipate human gaze opens up the possibility for machines to more seamlessly share attention with humans, or to intervene in a timely manner if humans are about to overlook important aspects of the environment. We are the first to propose methods for the anticipation of eye contact in dyadic conversations, as well as in the context of mobile device interactions during daily life, thereby paving the way for interfaces that are able to proactively intervene and support interacting humans.Blick, Gesichtsausdrücke, Körpersprache, oder Prosodie spielen als nonverbale Signale eine zentrale Rolle in menschlicher Kommunikation. Sie wurden durch vielzählige Studien mit wichtigen Konzepten wie Emotionen, Sprecherwechsel, Führung, oder der Qualität des Verhältnisses zwischen zwei Personen in Verbindung gebracht. Damit Menschen effektiv während ihres täglichen sozialen Lebens von Maschinen unterstützt werden können, sind automatische Methoden zur Erkennung, Interpretation, und Antizipation von nonverbalem Verhalten notwendig. Obwohl die bisherige Forschung in kontrollierten Studien zu ermutigenden Ergebnissen gekommen ist, bleibt die automatische Analyse nonverbalen Verhaltens in weniger kontrollierten Situationen eine Herausforderung. Darüber hinaus existieren kaum Untersuchungen zur Antizipation von nonverbalem Verhalten in sozialen Situationen. Das Ziel dieser Arbeit ist, die Vision vom automatischen Verstehen sozialer Situationen ein Stück weit mehr Realität werden zu lassen. Diese Arbeit liefert wichtige Beiträge zur autmatischen Erkennung menschlichen Blickverhaltens in alltäglichen Situationen. Obwohl viele soziale Interaktionen in Gruppen stattfinden, existieren unüberwachte Methoden zur Augenkontakterkennung bisher lediglich für dyadische Interaktionen. Wir stellen einen neuen Ansatz zur Augenkontakterkennung in Gruppen vor, welcher ohne manuelle Annotationen auskommt, indem er sich den statistischen Zusammenhang zwischen Blick- und Sprechverhalten zu Nutze macht. Tägliche Aktivitäten sind eine Herausforderung für Geräte zur mobile Augenbewegungsmessung, da Verschiebungen dieser Geräte zur Verschlechterung ihrer Kalibrierung führen können. In dieser Arbeit verwenden wir Nutzerverhalten an mobilen Endgeräten, um den Effekt solcher Verschiebungen zu korrigieren. Neben der Erkennung verbessert diese Arbeit auch die Interpretation sozialer Signale. Wir veröffentlichen den ersten Datensatz sowie die erste Methode zur Emotionserkennung in dyadischen Interaktionen ohne den Einsatz spezialisierter Ausrüstung. Außerdem stellen wir die erste Studie zur automatischen Erkennung mangelnder Verbundenheit in Gruppeninteraktionen vor, und führen die erste datensatzübergreifende Evaluierung zur Detektion von sich entwickelndem Führungsverhalten durch. Zum Abschluss der Arbeit präsentieren wir die ersten Ansätze zur Antizipation von Blickverhalten in sozialen Interaktionen. Blickverhalten hat die besondere Eigenschaft, dass es sowohl als soziales Signal als auch der Ausrichtung der visuellen Wahrnehmung dient. Somit eröffnet die Fähigkeit zur Antizipation von Blickverhalten Maschinen die Möglichkeit, sich sowohl nahtloser in soziale Interaktionen einzufügen, als auch Menschen zu warnen, wenn diese Gefahr laufen wichtige Aspekte der Umgebung zu übersehen. Wir präsentieren Methoden zur Antizipation von Blickverhalten im Kontext der Interaktion mit mobilen Endgeräten während täglicher Aktivitäten, als auch während dyadischer Interaktionen mittels Videotelefonie

    Decoding attentional load in visual perception: a signal processing approach

    Get PDF
    Previous research has established that visual perception tasks high in attentional load (or ‘perceptual load’, defined operationally to include either a larger number of items or a greater perceptual processing demand) result in reduced perceptual sensitivity and cortical response for visual stimuli outside the focus of attention. However, there are three challenges facing the load theory of attention today. The first is to describe a neural mechanism by which load-induced perceptual deficits are explained; the second is to clarify the concept of perceptual load and develop a method for estimating the load induced by a visual task a priori, without recourse to measures of secondary perceptual effects; and the third is to extend the study of attentional load to natural, real-world, visual tasks. In this thesis we employ signal processing and machine learning approaches to address these challenges. In Chapters 3 and 4 it is shown that high perceptual load degrades the perception of orientation by modulating the tuning curves of neural populations in early visual cortex. The combination of tuning curve modulations reported is unique to perceptual load, inducing broadened tuning as well as reductions in tuning amplitude and overall neural activity, and so provides a novel low-level mechanism for behaviourally relevant failures of vision such as inattentional blindness. In Chapter 5, a predictive model of perceptual load during the task of driving is produced. The high variation in perceptual demands during real-world driving allow the construction of a direct fine-scale mapping between high-resolution natural imagery, captured from a driver's point-of-view, and induced perceptual load. The model therefore constitutes the first system able to produce a priori estimates of load directly from visual characteristics of a natural task, extending research into the antecedents of perceptual load beyond the realm of austere laboratory displays. Taken together, the findings of this thesis represent major theoretical advances into both the causes and effects of high perceptual load

    Real-Time Detection System of Driver Distraction Using Machine Learning

    Get PDF

    Study on recognition of facial expressions of affect

    Get PDF
    Facial expression recognition is a particularly interesting field of computer vision since it brings innumerable benefits to our society. Benefits that can be translated into a large number of applications in subjects such as, neuroscience, psychology or computer science. The relevance of the topic is reflected in the vast literature already produced describing notable signs of progress. However, the development and the advancement of new approaches is still facing multiple challenges. Challenges including head-pose variations, illumination variations, identity bias, occlusions, and registration errors. One of the focus in this field is to achieve similar results when moving from a controlled environment to a more naturalistic scenario. Though facial expression recognition has been addressed in considerable different projects, it is feasible to emphasize the call for attention to the design of an interface that simulates addressing patient engagement in healthcare. Since it has been noticed a rising tendency in engaging patients in their healthcare. There are still some open questions need to be answered to make a significant impact on health care

    Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

    Get PDF
    Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a videobased task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance

    Detecting Biological Motion for Human-Robot Interaction: A Link between Perception and Action

    Get PDF
    One of the fundamental skills supporting safe and comfortable interaction between humans is their capability to understand intuitively each other's actions and intentions. At the basis of this ability is a special-purpose visual processing that human brain has developed to comprehend human motion. Among the first "building blocks" enabling the bootstrapping of such visual processing is the ability to detect movements performed by biological agents in the scene, a skill mastered by human babies in the first days of their life. In this paper, we present a computational model based on the assumption that such visual ability must be based on local low-level visual motion features, which are independent of shape, such as the configuration of the body and perspective. Moreover, we implement it on the humanoid robot iCub, embedding it into a software architecture that leverages the regularities of biological motion also to control robot attention and oculomotor behaviors. In essence, we put forth a model in which the regularities of biological motion link perception and action enabling a robotic agent to follow a human-inspired sensory-motor behavior. We posit that this choice facilitates mutual understanding and goal prediction during collaboration, increasing the pleasantness and safety of the interactio

    Improving Visual Embeddings using Attention and Geometry Constraints

    Get PDF
    Learning a non-linear function to embed the raw data (i.e., image, video, or language) to a discriminative feature embedding space is considered a fundamental problem in the learning community. In such embedding spaces, the data with similar semantic meaning are clustered, while the data with dissimilar semantic meaning are separated. A number of practical applications can benefit from a good feature embedding, e.g., machine translation, classification/recognition, retrieval, any-shot learning, etc In this Thesis, we aim to improve the visual embeddings using attention and geometry constraints. In the first part of the Thesis, we develop two neural attention modules, which can automatically localize the informative regions within the feature map, thereby generating a discriminative feature representation for the image. An Attention in Attention (AiA) mechanism is first proposed to align the feature map along with the deep network, by modeling the interaction of inner attention and outer attention modules. Intuitively, the AiA mechanism can be understood as having an attention inside another, with the inner one determining where to focus for the outer attention module. Further, we employ explicit non-linear mappings in Reproducing Kernel Hilbert Spaces (RHKSs) to generate attention values, leading the channel descriptor of the feature map to own the representation power of second-order polynomial kernel and Gaussian kernel. In addition, the Channel Recurrent Attention (CRA) module is proposed to build a global receptive field to the feature map. The existing attention mechanisms focus on either the channel pattern or the spatial pattern of the feature map, which cannot make full use of the information in the feature map. The CRA module can jointly learn the channel and spatial patterns of the feature map and produce attention value per every element of the input feature map. This is achieved by feeding the spatial vectors to a recurrent neural network (RNN) sequentially, such that the RNN can create a global view of the feature map. In the second part, we investigate the superiority of geometry constraint for embedding learning. We first study the geometry concern of the set as an embedding for a video clip. Usually, the video embedding is optimized using triplet loss, in which the distance is calculated between clip features, such that the frame feature cannot be optimized directly. To this end, we model the video clip as a set, and employ the distance between sets in the triplet loss. Tailored for the set-aware triplet loss, a new set distance metric is also proposed to measure the hard frames in a triplet. Optimizing over set-aware triplet loss leads to a compact clip feature embedding, improving the discriminative of the video representation. Beyond the flat Euclidean embedding space, we further study a curved space, i.e., hyperbolic spaces, as image embedding spaces. In contrast to Euclidean embedding, hyperbolic embedding can encode the data's hierarchical structure, as the volume of hyperbolic space increases exponentially. However, performing basic operations for comparison in hyperbolic spaces is complex and time-consuming. For example, the similarity measure is not well-defined in hyperbolic spaces. To mitigate this issue, we introduce the positive definite (pd) kernels for hyperbolic embeddings. Specifically, we propose four pd kernels in hyperbolic spaces in conjunction with a theoretical analysis. The proposed kernels include hyperbolic tangent kernel, hyperbolic RBF kernel, hyperbolic Laplace kernel, and hyperbolic binomial kernel. We demonstrate the effectiveness of the proposed methods via a image or video person re-identification task. We also evaluate the generalization of hyperbolic kernels by few-shot learning, zero-shot learning and knowledge distillation tasks

    Head pose estimation and attentive behavior detection

    Get PDF
    Master'sMASTER OF ENGINEERIN
    corecore