2,704 research outputs found
M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues
We present M3ER, a learning-based method for emotion recognition from
multiple input modalities. Our approach combines cues from multiple
co-occurring modalities (such as face, text, and speech) and also is more
robust than other methods to sensor noise in any of the individual modalities.
M3ER models a novel, data-driven multiplicative fusion method to combine the
modalities, which learn to emphasize the more reliable cues and suppress others
on a per-sample basis. By introducing a check step which uses Canonical
Correlational Analysis to differentiate between ineffective and effective
modalities, M3ER is robust to sensor noise. M3ER also generates proxy features
in place of the ineffectual modalities. We demonstrate the efficiency of our
network through experimentation on two benchmark datasets, IEMOCAP and
CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on
CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work
Experience-driven formation of parts-based representations in a model of layered visual memory
Growing neuropsychological and neurophysiological evidence suggests that the
visual cortex uses parts-based representations to encode, store and retrieve
relevant objects. In such a scheme, objects are represented as a set of
spatially distributed local features, or parts, arranged in stereotypical
fashion. To encode the local appearance and to represent the relations between
the constituent parts, there has to be an appropriate memory structure formed
by previous experience with visual objects. Here, we propose a model how a
hierarchical memory structure supporting efficient storage and rapid recall of
parts-based representations can be established by an experience-driven process
of self-organization. The process is based on the collaboration of slow
bidirectional synaptic plasticity and homeostatic unit activity regulation,
both running at the top of fast activity dynamics with winner-take-all
character modulated by an oscillatory rhythm. These neural mechanisms lay down
the basis for cooperation and competition between the distributed units and
their synaptic connections. Choosing human face recognition as a test task, we
show that, under the condition of open-ended, unsupervised incremental
learning, the system is able to form memory traces for individual faces in a
parts-based fashion. On a lower memory layer the synaptic structure is
developed to represent local facial features and their interrelations, while
the identities of different persons are captured explicitly on a higher layer.
An additional property of the resulting representations is the sparseness of
both the activity during the recall and the synaptic patterns comprising the
memory traces.Comment: 34 pages, 12 Figures, 1 Table, published in Frontiers in
Computational Neuroscience (Special Issue on Complex Systems Science and
Brain Dynamics),
http://www.frontiersin.org/neuroscience/computationalneuroscience/paper/10.3389/neuro.10/015.2009
Comparator Networks
The objective of this work is set-based verification, e.g. to decide if two
sets of images of a face are of the same person or not. The traditional
approach to this problem is to learn to generate a feature vector per image,
aggregate them into one vector to represent the set, and then compute the
cosine similarity between sets. Instead, we design a neural network
architecture that can directly learn set-wise verification. Our contributions
are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of
sets (each may contain a variable number of images) as inputs, and compute a
similarity between the pair--this involves attending to multiple discriminative
local regions (landmarks), and comparing local descriptors between pairs of
faces; (ii) To encourage high-quality representations for each set, internal
competition is introduced for recalibration based on the landmark score; (iii)
Inspired by image retrieval, a novel hard sample mining regime is proposed to
control the sampling process, such that the DCN is complementary to the
standard image classification models. Evaluations on the IARPA Janus face
recognition benchmarks show that the comparator networks outperform the
previous state-of-the-art results by a large margin.Comment: To appear in ECCV 201
How to do things without words
Clark and Chalmers (1998) defend the hypothesis of an âExtended Mindâ, maintaining that beliefs and other paradigmatic mental states can be implemented outside the central nervous system or body. Aspects of the problem of âlanguage acquisitionâ are considered in the light of the extended mind hypothesis. Rather than âlanguageâ as typically understood, the object of study is something called âutterance-activityâ, a term of art intended to refer to the full range of kinetic and prosodic features of the on-line behaviour of interacting humans. It is argued that utterance activity is plausibly regarded as jointly controlled by the embodied activity of interacting people, and that it contributes to the control of their behaviour. By means of specific examples it is suggested that this complex joint control facilitates easier learning of at least some features of language. This in turn suggests a striking form of the extended mind, in which infantsâ cognitive powers are augmented by those of the people with whom they interact
Real-Time Purchase Prediction Using Retail Video Analytics
The proliferation of video data in retail marketing brings opportunities for researchers to study customer behavior using rich video information. Our study demonstrates how to understand customer behavior of multiple dimensions using video analytics on a scalable basis. We obtained a unique video footage data collected from in-store cameras, resulting in approximately 20,000 customers involved and over 6,000 payments recorded. We extracted features on the demographics, appearance, emotion, and contextual dimensions of customer behavior from the video with state-of-the-art computer vision techniques and proposed a novel framework using machine learning and deep learning models to predict consumer purchase decision. Results showed that our framework makes accurate predictions which indicate the importance of incorporating emotional response into prediction. Our findings reveal multi-dimensional drivers of purchase decision and provide an implementable video analytics tool for marketers. It shows possibility of involving personalized recommendations that would potentially integrate our framework into omnichannel landscape
- âŠ