2,062 research outputs found
Learning Temporal Alignment Uncertainty for Efficient Event Detection
In this paper we tackle the problem of efficient video event detection. We
argue that linear detection functions should be preferred in this regard due to
their scalability and efficiency during estimation and evaluation. A popular
approach in this regard is to represent a sequence using a bag of words (BOW)
representation due to its: (i) fixed dimensionality irrespective of the
sequence length, and (ii) its ability to compactly model the statistics in the
sequence. A drawback to the BOW representation, however, is the intrinsic
destruction of the temporal ordering information. In this paper we propose a
new representation that leverages the uncertainty in relative temporal
alignments between pairs of sequences while not destroying temporal ordering.
Our representation, like BOW, is of a fixed dimensionality making it easily
integrated with a linear detection function. Extensive experiments on CK+,
6DMG, and UvA-NEMO databases show significant performance improvements across
both isolated and continuous event detection tasks.Comment: Appeared in DICTA 2015, 8 page
MicroExpNet: An Extremely Small and Fast Model For Expression Recognition From Face Images
This paper is aimed at creating extremely small and fast convolutional neural
networks (CNN) for the problem of facial expression recognition (FER) from
frontal face images. To this end, we employed the popular knowledge
distillation (KD) method and identified two major shortcomings with its use: 1)
a fine-grained grid search is needed for tuning the temperature hyperparameter
and 2) to find the optimal size-accuracy balance, one needs to search for the
final network size (or the compression rate). On the other hand, KD is proved
to be useful for model compression for the FER problem, and we discovered that
its effects gets more and more significant with the decreasing model size. In
addition, we hypothesized that translation invariance achieved using
max-pooling layers would not be useful for the FER problem as the expressions
are sensitive to small, pixel-wise changes around the eye and the mouth.
However, we have found an intriguing improvement on generalization when
max-pooling is used. We conducted experiments on two widely-used FER datasets,
CK+ and Oulu-CASIA. Our smallest model (MicroExpNet), obtained using knowledge
distillation, is less than 1MB in size and works at 1851 frames per second on
an Intel i7 CPU. Despite being less accurate than the state-of-the-art,
MicroExpNet still provides significant insights for designing a
microarchitecture for the FER problem.Comment: International Conference on Image Processing Theory, Tools and
Applications (IPTA) 2019 camera ready version. Codes are available at:
https://github.com/cuguilke/microexpne
Discriminatively Trained Latent Ordinal Model for Video Classification
We study the problem of video classification for facial analysis and human
action recognition. We propose a novel weakly supervised learning method that
models the video as a sequence of automatically mined, discriminative
sub-events (eg. onset and offset phase for "smile", running and jumping for
"highjump"). The proposed model is inspired by the recent works on Multiple
Instance Learning and latent SVM/HCRF -- it extends such frameworks to model
the ordinal aspect in the videos, approximately. We obtain consistent
improvements over relevant competitive baselines on four challenging and
publicly available video based facial analysis datasets for prediction of
expression, clinical pain and intent in dyadic conversations and on three
challenging human action datasets. We also validate the method with qualitative
results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text
overlap with arXiv:1604.0150
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Facial Expression Recognition (FER) is a crucial task in affective computing,
but its conventional focus on the seven basic emotions limits its applicability
to the complex and expanding emotional spectrum. To address the issue of new
and unseen emotions present in dynamic in-the-wild FER, we propose a novel
vision-language model that utilises sample-level text descriptions (i.e.
captions of the context, expressions or emotional cues) as natural language
supervision, aiming to enhance the learning of rich latent representations, for
zero-shot classification. To test this, we evaluate using zero-shot
classification of the model trained on sample-level descriptions on four
popular dynamic FER datasets. Our findings show that this approach yields
significant improvements when compared to baseline methods. Specifically, for
zero-shot video FER, we outperform CLIP by over 10\% in terms of Weighted
Average Recall and 5\% in terms of Unweighted Average Recall on several
datasets. Furthermore, we evaluate the representations obtained from the
network trained using sample-level descriptions on the downstream task of
mental health symptom estimation, achieving performance comparable or superior
to state-of-the-art methods and strong agreement with human experts. Namely, we
achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia
symptom severity estimation, which is comparable to human experts' agreement.
The code is publicly available at: https://github.com/NickyFot/EmoCLIP.Comment: 10 pages, 3 figure
- …