3,161 research outputs found
“It's Not What You Say, But How You Say it”: A Reciprocal Temporo-frontal Network for Affective Prosody
Humans communicate emotion vocally by modulating acoustic cues such as pitch, intensity and voice quality. Research has documented how the relative presence or absence of such cues alters the likelihood of perceiving an emotion, but the neural underpinnings of acoustic cue-dependent emotion perception remain obscure. Using functional magnetic resonance imaging in 20 subjects we examined a reciprocal circuit consisting of superior temporal cortex, amygdala and inferior frontal gyrus that may underlie affective prosodic comprehension. Results showed that increased saliency of emotion-specific acoustic cues was associated with increased activation in superior temporal cortex [planum temporale (PT), posterior superior temporal gyrus (pSTG), and posterior superior middle gyrus (pMTG)] and amygdala, whereas decreased saliency of acoustic cues was associated with increased inferior frontal activity and temporo-frontal connectivity. These results suggest that sensory-integrative processing is facilitated when the acoustic signal is rich in affective information, yielding increased activation in temporal cortex and amygdala. Conversely, when the acoustic signal is ambiguous, greater evaluative processes are recruited, increasing activation in inferior frontal gyrus (IFG) and IFG STG connectivity. Auditory regions may thus integrate acoustic information with amygdala input to form emotion-specific representations, which are evaluated within inferior frontal regions
Semi-Supervised Speech Emotion Recognition with Ladder Networks
Speech emotion recognition (SER) systems find applications in various fields
such as healthcare, education, and security and defense. A major drawback of
these systems is their lack of generalization across different conditions. This
problem can be solved by training models on large amounts of labeled data from
the target domain, which is expensive and time-consuming. Another approach is
to increase the generalization of the models. An effective way to achieve this
goal is by regularizing the models through multitask learning (MTL), where
auxiliary tasks are learned along with the primary task. These methods often
require the use of labeled data which is computationally expensive to collect
for emotion recognition (gender, speaker identity, age or other emotional
descriptors). This study proposes the use of ladder networks for emotion
recognition, which utilizes an unsupervised auxiliary task. The primary task is
a regression problem to predict emotional attributes. The auxiliary task is the
reconstruction of intermediate feature representations using a denoising
autoencoder. This auxiliary task does not require labels so it is possible to
train the framework in a semi-supervised fashion with abundant unlabeled data
from the target domain. This study shows that the proposed approach creates a
powerful framework for SER, achieving superior performance than fully
supervised single-task learning (STL) and MTL baselines. The approach is
implemented with several acoustic features, showing that ladder networks
generalize significantly better in cross-corpus settings. Compared to the STL
baselines, the proposed approach achieves relative gains in concordance
correlation coefficient (CCC) between 3.0% and 3.5% for within corpus
evaluations, and between 16.1% and 74.1% for cross corpus evaluations,
highlighting the power of the architecture
Looking Beyond a Clever Narrative: Visual Context and Attention are Primary Drivers of Affect in Video Advertisements
Emotion evoked by an advertisement plays a key role in influencing brand
recall and eventual consumer choices. Automatic ad affect recognition has
several useful applications. However, the use of content-based feature
representations does not give insights into how affect is modulated by aspects
such as the ad scene setting, salient object attributes and their interactions.
Neither do such approaches inform us on how humans prioritize visual
information for ad understanding. Our work addresses these lacunae by
decomposing video content into detected objects, coarse scene structure, object
statistics and actively attended objects identified via eye-gaze. We measure
the importance of each of these information channels by systematically
incorporating related information into ad affect prediction models. Contrary to
the popular notion that ad affect hinges on the narrative and the clever use of
linguistic and social cues, we find that actively attended objects and the
coarse scene structure better encode affective information as compared to
individual scene objects or conspicuous background elements.Comment: Accepted for publication in the Proceedings of 20th ACM International
Conference on Multimodal Interaction, Boulder, CO, US
STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits
We present a novel classifier network called STEP, to classify perceived
human emotion from gaits, based on a Spatial Temporal Graph Convolutional
Network (ST-GCN) architecture. Given an RGB video of an individual walking, our
formulation implicitly exploits the gait features to classify the emotional
state of the human into one of four emotions: happy, sad, angry, or neutral. We
use hundreds of annotated real-world gait videos and augment them with
thousands of annotated synthetic gaits generated using a novel generative
network called STEP-Gen, built on an ST-GCN based Conditional Variational
Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the
CVAE formulation of STEP-Gen to generate realistic gaits and improve the
classification accuracy of STEP. We also release a novel dataset (E-Gait),
which consists of human gaits annotated with perceived emotions along
with thousands of synthetic gaits. In practice, STEP can learn the affective
features and exhibits classification accuracy of 89% on E-Gait, which is 14 -
30% more accurate over prior methods
- …