2,711 research outputs found
Affective Facial Expression Processing via Simulation: A Probabilistic Model
Understanding the mental state of other people is an important skill for
intelligent agents and robots to operate within social environments. However,
the mental processes involved in `mind-reading' are complex. One explanation of
such processes is Simulation Theory - it is supported by a large body of
neuropsychological research. Yet, determining the best computational model or
theory to use in simulation-style emotion detection, is far from being
understood.
In this work, we use Simulation Theory and neuroscience findings on
Mirror-Neuron Systems as the basis for a novel computational model, as a way to
handle affective facial expressions. The model is based on a probabilistic
mapping of observations from multiple identities onto a single fixed identity
(`internal transcoding of external stimuli'), and then onto a latent space
(`phenomenological response'). Together with the proposed architecture we
present some promising preliminary resultsComment: Annual International Conference on Biologically Inspired Cognitive
Architectures - BICA 201
Recommended from our members
Learning human activities and poses with interconnected data sources
Understanding human actions and poses in images or videos is a challenging problem in computer vision. There are different topics related to this problem such as action recognition, pose estimation, human-object interaction, and activity detection. Knowledge of actions and poses could benefit many applications, including video search, surveillance, auto-tagging, event detection, and human-computer interfaces. To understand humans' actions and poses, we need to address several challenges. First, humans are able to perform an enormous amount of poses. For example, simply to move forward, we can do crawling, walking, running, and sprinting. These poses all look different and require examples to cover these variations. Second, the appearance of a person's pose changes when looking from different viewing angles. The learned action model needs to cover the variations from different views. Third, many actions involve interactions between people and other objects, so we need to consider the appearance change corresponding to that object as well. Fourth, collecting such data for learning is difficult and expensive. Last, even if we can learn a good model for an action, to localize when and where the action happens in a long video remains a difficult problem due to the large search space. My key idea to alleviate these obstacles in learning humans' actions and poses is to discover the underlying patterns that connect the information from different data sources. Why will there be underlying patterns? The intuition is that all people share the same articulated physical structure. Though we can change our pose, there are common regulations that limit how our pose can be and how it can move over time. Therefore, all types of human data will follow these rules and they can serve as prior knowledge or regularization in our learning framework. If we can exploit these tendencies, we are able to extract additional information from data and use them to improve learning of humans' actions and poses. In particular, we are able to find patterns for how our pose could vary over time, how our appearance looks in a specific view, how our pose is when we are interacting with objects with certain properties, and how part of our body configuration is shared across different poses. If we could learn these patterns, they can be used to interconnect and extrapolate the knowledge between different data sources. To this end, I propose several new ways to connect human activity data. First, I show how to connect snapshot images and videos by exploring the patterns of how our pose could change over time. Building on this idea, I explore how to connect humans' poses across multiple views by discovering the correlations between different poses and the latent factors that affect the viewpoint variations. In addition, I consider if there are also patterns connecting our poses and nearby objects when we are interacting with them. Furthermore, I explore how we can utilize the predicted interaction as a cue to better address existing recognition problems including image re-targeting and image description generation. Finally, after learning models effectively incorporating these patterns, I propose a robust approach to efficiently localize when and where a complex action happens in a video sequence. The variants of my proposed approaches offer a good trade-off between computational cost and detection accuracy. My thesis exploits various types of underlying patterns in human data. The discovered structure is used to enhance the understanding of humans' actions and poses. By my proposed methods, we are able to 1) learn an action with very few snapshots by connecting them to a pool of label-free videos, 2) infer the pose for some views even without any examples by connecting the latent factors between different views, 3) predict the location of an object that a person is interacting with independent of the type and appearance of that object, then use the inferred interaction as a cue to improve recognition, and 4) localize an action in a complex long video. These approaches improve existing frameworks for understanding humans' actions and poses without extra data collection cost and broaden the problems that we can tackle.Computer Science
Improving Robot Perception Skills Using a Fast Image-Labelling Method with Minimal Human Intervention
[EN] Featured Application Natural interface to enhance human-robot interactions. The aim is to improve robot perception skills. Robot perception skills contribute to natural interfaces that enhance human-robot interactions. This can be notably improved by using convolutional neural networks. To train a convolutional neural network, the labelling process is the crucial first stage, in which image objects are marked with rectangles or masks. There are many image-labelling tools, but all require human interaction to achieve good results. Manual image labelling with rectangles or masks is labor-intensive and unappealing work, which can take months to complete, making the labelling task tedious and lengthy. This paper proposes a fast method to create labelled images with minimal human intervention, which is tested with a robot perception task. Images of objects taken with specific backgrounds are quickly and accurately labelled with rectangles or masks. In a second step, detected objects can be synthesized with different backgrounds to improve the training capabilities of the image set. Experimental results show the effectiveness of this method with an example of human-robot interaction using hand fingers. This labelling method generates a database to train convolutional networks to detect hand fingers easily with minimal labelling work. This labelling method can be applied to new image sets or used to add new samples to existing labelled image sets of any application. This proposed method improves the labelling process noticeably and reduces the time required to start the training process of a convolutional neural network model.The Universitat Politecnica de Valencia has financed the open access fees of this paper with the project number 20200676 (Microinspeccion de superficies).Ricolfe Viala, C.; Blanes Campos, C. (2022). Improving Robot Perception Skills Using a Fast Image-Labelling Method with Minimal Human Intervention. Applied Sciences. 12(3):1-14. https://doi.org/10.3390/app1203155711412
Recommended from our members
Creatures Great and SMAL: Recovering the Shape and Motion of Animals from Video
We present a system to recover the 3D shape and motion of a wide variety of
quadrupeds from video. The system comprises a machine learning front-end which
predicts candidate 2D joint positions, a discrete optimization which finds
kinematically plausible joint correspondences, and an energy minimization stage
which fits a detailed 3D model to the image. In order to overcome the limited
availability of motion capture training data from animals, and the difficulty
of generating realistic synthetic training images, the system is designed to
work on silhouette data. The joint candidate predictor is trained on
synthetically generated silhouette images, and at test time, deep learning
methods or standard video segmentation tools are used to extract silhouettes
from real data. The system is tested on animal videos from several species, and
shows accurate reconstructions of 3D shape and pose.GlaxoSmithKlin
Creatures Great and SMAL: Recovering the Shape and Motion of Animals from Video
We present a system to recover the 3D shape and motion of a wide variety of
quadrupeds from video. The system comprises a machine learning front-end which
predicts candidate 2D joint positions, a discrete optimization which finds
kinematically plausible joint correspondences, and an energy minimization stage
which fits a detailed 3D model to the image. In order to overcome the limited
availability of motion capture training data from animals, and the difficulty
of generating realistic synthetic training images, the system is designed to
work on silhouette data. The joint candidate predictor is trained on
synthetically generated silhouette images, and at test time, deep learning
methods or standard video segmentation tools are used to extract silhouettes
from real data. The system is tested on animal videos from several species, and
shows accurate reconstructions of 3D shape and pose.GlaxoSmithKlin
Advancing Multi-Modal Deep Learning: Towards Language-Grounded Visual Understanding
Using deep learning, computer vision now rivals people at object recognition and detection, opening doors to tackle new challenges in image understanding. Among these challenges, understanding and reasoning about language grounded visual content is of fundamental importance to advancing artificial intelligence. Recently, multiple datasets and algorithms have been created as proxy tasks towards this goal, with visual question answering (VQA) being the most widely studied. In VQA, an algorithm needs to produce an answer to a natural language question about an image. However, our survey of datasets and algorithms for VQA uncovered several sources of dataset bias and sub-optimal evaluation metrics that allowed algorithms to perform well by merely exploiting superficial statistical patterns. In this dissertation, we describe new algorithms and datasets that address these issues. We developed two new datasets and evaluation metrics that enable a more accurate measurement of abilities of a VQA model, and also expand VQA to include new abilities, such as reading text, handling out-of-vocabulary words, and understanding data-visualization. We also created new algorithms for VQA that have helped advance the state-of-the-art for VQA, including an algorithm that surpasses humans on two different chart question answering datasets about bar-charts, line-graphs and pie charts. Finally, we provide a holistic overview of several yet-unsolved challenges in not only VQA but vision and language research at large. Despite enormous progress, we find that a robust understanding and integration of vision and language is still an elusive goal, and much of the progress may be misleading due to dataset bias, superficial correlations and flaws in standard evaluation metrics. We carefully study and categorize these issues for several vision and language tasks and outline several possible paths towards development of safe, robust and trustworthy AI for language-grounded visual understanding
- …