219 research outputs found
Seeking Salient Facial Regions for Cross-Database Micro-Expression Recognition
Cross-Database Micro-Expression Recognition (CDMER) aims to develop the
Micro-Expression Recognition (MER) methods with strong domain adaptability,
i.e., the ability to recognize the Micro-Expressions (MEs) of different
subjects captured by different imaging devices in different scenes. The
development of CDMER is faced with two key problems: 1) the severe feature
distribution gap between the source and target databases; 2) the feature
representation bottleneck of ME such local and subtle facial expressions. To
solve these problems, this paper proposes a novel Transfer Group Sparse
Regression method, namely TGSR, which aims to 1) optimize the measurement and
better alleviate the difference between the source and target databases, and 2)
highlight the valid facial regions to enhance extracted features, by the
operation of selecting the group features from the raw face feature, where each
region is associated with a group of raw face feature, i.e., the salient facial
region selection. Compared with previous transfer group sparse methods, our
proposed TGSR has the ability to select the salient facial regions, which is
effective in alleviating the aforementioned problems for better performance and
reducing the computational cost at the same time. We use two public ME
databases, i.e., CASME II and SMIC, to evaluate our proposed TGSR method.
Experimental results show that our proposed TGSR learns the discriminative and
explicable regions, and outperforms most state-of-the-art
subspace-learning-based domain-adaptive methods for CDMER
Learning Local to Global Feature Aggregation for Speech Emotion Recognition
Transformer has emerged in speech emotion recognition (SER) at present.
However, its equal patch division not only damages frequency information but
also ignores local emotion correlations across frames, which are key cues to
represent emotion. To handle the issue, we propose a Local to Global Feature
Aggregation learning (LGFA) for SER, which can aggregate longterm emotion
correlations at different scales both inside frames and segments with entire
frequency information to enhance the emotion discrimination of utterance-level
speech features. For this purpose, we nest a Frame Transformer inside a Segment
Transformer. Firstly, Frame Transformer is designed to excavate local emotion
correlations between frames for frame embeddings. Then, the frame embeddings
and their corresponding segment features are aggregated as different-level
complements to be fed into Segment Transformer for learning utterance-level
global emotion features. Experimental results show that the performance of LGFA
is superior to the state-of-the-art methods.Comment: This paper has been accepted on INTERSPEECH 202
SDFE-LV: A Large-Scale, Multi-Source, and Unconstrained Database for Spotting Dynamic Facial Expressions in Long Videos
In this paper, we present a large-scale, multi-source, and unconstrained
database called SDFE-LV for spotting the onset and offset frames of a complete
dynamic facial expression from long videos, which is known as the topic of
dynamic facial expression spotting (DFES) and a vital prior step for lots of
facial expression analysis tasks. Specifically, SDFE-LV consists of 1,191 long
videos, each of which contains one or more complete dynamic facial expressions.
Moreover, each complete dynamic facial expression in its corresponding long
video was independently labeled for five times by 10 well-trained annotators.
To the best of our knowledge, SDFE-LV is the first unconstrained large-scale
database for the DFES task whose long videos are collected from multiple
real-world/closely real-world media sources, e.g., TV interviews,
documentaries, movies, and we-media short videos. Therefore, DFES tasks on
SDFE-LV database will encounter numerous difficulties in practice such as head
posture changes, occlusions, and illumination. We also provided a comprehensive
benchmark evaluation from different angles by using lots of recent
state-of-the-art deep spotting methods and hence researchers interested in DFES
can quickly and easily get started. Finally, with the deep discussions on the
experimental evaluation results, we attempt to point out several meaningful
directions to deal with DFES tasks and hope that DFES can be better advanced in
the future. In addition, SDFE-LV will be freely released for academic use only
as soon as possible
Super-Resolution by Predicting Offsets: An Ultra-Efficient Super-Resolution Network for Rasterized Images
Rendering high-resolution (HR) graphics brings substantial computational
costs. Efficient graphics super-resolution (SR) methods may achieve HR
rendering with small computing resources and have attracted extensive research
interests in industry and research communities. We present a new method for
real-time SR for computer graphics, namely Super-Resolution by Predicting
Offsets (SRPO). Our algorithm divides the image into two parts for processing,
i.e., sharp edges and flatter areas. For edges, different from the previous SR
methods that take the anti-aliased images as inputs, our proposed SRPO takes
advantage of the characteristics of rasterized images to conduct SR on the
rasterized images. To complement the residual between HR and low-resolution
(LR) rasterized images, we train an ultra-efficient network to predict the
offset maps to move the appropriate surrounding pixels to the new positions.
For flat areas, we found simple interpolation methods can already generate
reasonable output. We finally use a guided fusion operation to integrate the
sharp edges generated by the network and flat areas by the interpolation method
to get the final SR image. The proposed network only contains 8,434 parameters
and can be accelerated by network quantization. Extensive experiments show that
the proposed SRPO can achieve superior visual effects at a smaller
computational cost than the existing state-of-the-art methods.Comment: This article has been accepted by ECCV202
Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition
In this paper, we propose a new unsupervised domain adaptation (DA) method
called layer-adapted implicit distribution alignment networks (LIDAN) to
address the challenge of cross-corpus speech emotion recognition (SER). LIDAN
extends our previous ICASSP work, deep implicit distribution alignment networks
(DIDAN), whose key contribution lies in the introduction of a novel
regularization term called implicit distribution alignment (IDA). This term
allows DIDAN trained on source (training) speech samples to remain applicable
to predicting emotion labels for target (testing) speech samples, regardless of
corpus variance in cross-corpus SER. To further enhance this method, we extend
IDA to layer-adapted IDA (LIDA), resulting in LIDAN. This layer-adpated
extention consists of three modified IDA terms that consider emotion labels at
different levels of granularity. These terms are strategically arranged within
different fully connected layers in LIDAN, aligning with the increasing
emotion-discriminative abilities with respect to the layer depth. This
arrangement enables LIDAN to more effectively learn emotion-discriminative and
corpus-invariant features for SER across various corpora compared to DIDAN. It
is also worthy to mention that unlike most existing methods that rely on
estimating statistical moments to describe pre-assumed explicit distributions,
both IDA and LIDA take a different approach. They utilize an idea of target
sample reconstruction to directly bridge the feature distribution gap without
making assumptions about their distribution type. As a result, DIDAN and LIDAN
can be viewed as implicit cross-corpus SER methods. To evaluate LIDAN, we
conducted extensive cross-corpus SER experiments on EmoDB, eNTERFACE, and CASIA
corpora. The experimental results demonstrate that LIDAN surpasses recent
state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER
tasks
- …