2,874 research outputs found
ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning
In this paper, we introduce a framework ARBEx, a novel attentive feature
extraction framework driven by Vision Transformer with reliability balancing to
cope against poor class distributions, bias, and uncertainty in the facial
expression learning (FEL) task. We reinforce several data pre-processing and
refinement methods along with a window-based cross-attention ViT to squeeze the
best of the data. We also employ learnable anchor points in the embedding space
with label distributions and multi-head self-attention mechanism to optimize
performance against weak predictions with reliability balancing, which is a
strategy that leverages anchor points, attention scores, and confidence values
to enhance the resilience of label predictions. To ensure correct label
classification and improve the models' discriminative power, we introduce
anchor loss, which encourages large margins between anchor points.
Additionally, the multi-head self-attention mechanism, which is also trainable,
plays an integral role in identifying accurate labels. This approach provides
critical elements for improving the reliability of predictions and has a
substantial positive effect on final prediction capabilities. Our adaptive
model can be integrated with any deep neural network to forestall challenges in
various recognition tasks. Our strategy outperforms current state-of-the-art
methodologies, according to extensive experiments conducted in a variety of
contexts.Comment: 10 pages, 7 figures. Code: https://github.com/takihasan/ARBE
Facial Expression Recognition using Vanilla ViT backbones with MAE Pretraining
Humans usually convey emotions voluntarily or involuntarily by facial
expressions. Automatically recognizing the basic expression (such as happiness,
sadness, and neutral) from a facial image, i.e., facial expression recognition
(FER), is extremely challenging and attracts much research interests. Large
scale datasets and powerful inference models have been proposed to address the
problem. Though considerable progress has been made, most of the state of the
arts employing convolutional neural networks (CNNs) or elaborately modified
Vision Transformers (ViTs) depend heavily on upstream supervised pretraining.
Transformers are taking place the domination of CNNs in more and more computer
vision tasks. But they usually need much more data to train, since they use
less inductive biases compared with CNNs. To explore whether a vanilla ViT
without extra training samples from upstream tasks is able to achieve
competitive accuracy, we use a plain ViT with MAE pretraining to perform the
FER task. Specifically, we first pretrain the original ViT as a Masked
Autoencoder (MAE) on a large facial expression dataset without expression
labels. Then, we fine-tune the ViT on popular facial expression datasets with
expression labels. The presented method is quite competitive with 90.22\% on
RAF-DB, 61.73\% on AfectNet and can serve as a simple yet strong ViT-based
baseline for FER studies.Comment: 3 page
ReSup: Reliable Label Noise Suppression for Facial Expression Recognition
Because of the ambiguous and subjective property of the facial expression
recognition (FER) task, the label noise is widely existing in the FER dataset.
For this problem, in the training phase, current FER methods often directly
predict whether the label of the input image is noised or not, aiming to reduce
the contribution of the noised data in training. However, we argue that this
kind of method suffers from the low reliability of such noise data decision
operation. It makes that some mistakenly abounded clean data are not utilized
sufficiently and some mistakenly kept noised data disturbing the model learning
process. In this paper, we propose a more reliable noise-label suppression
method called ReSup (Reliable label noise Suppression for FER). First, instead
of directly predicting noised or not, ReSup makes the noise data decision by
modeling the distribution of noise and clean labels simultaneously according to
the disagreement between the prediction and the target. Specifically, to
achieve optimal distribution modeling, ReSup models the similarity distribution
of all samples. To further enhance the reliability of our noise decision
results, ReSup uses two networks to jointly achieve noise suppression.
Specifically, ReSup utilize the property that two networks are less likely to
make the same mistakes, making two networks swap decisions and tending to trust
decisions with high agreement. Extensive experiments on three popular
benchmarks show that the proposed method significantly outperforms
state-of-the-art noisy label FER methods by 3.01% on FERPlus becnmarks. Code:
https://github.com/purpleleaves007/FERDenois
Leave No Stone Unturned: Mine Extra Knowledge for Imbalanced Facial Expression Recognition
Facial expression data is characterized by a significant imbalance, with most
collected data showing happy or neutral expressions and fewer instances of fear
or disgust. This imbalance poses challenges to facial expression recognition
(FER) models, hindering their ability to fully understand various human
emotional states. Existing FER methods typically report overall accuracy on
highly imbalanced test sets but exhibit low performance in terms of the mean
accuracy across all expression classes. In this paper, our aim is to address
the imbalanced FER problem. Existing methods primarily focus on learning
knowledge of minor classes solely from minor-class samples. However, we propose
a novel approach to extract extra knowledge related to the minor classes from
both major and minor class samples. Our motivation stems from the belief that
FER resembles a distribution learning task, wherein a sample may contain
information about multiple classes. For instance, a sample from the major class
surprise might also contain useful features of the minor class fear. Inspired
by that, we propose a novel method that leverages re-balanced attention maps to
regularize the model, enabling it to extract transformation invariant
information about the minor classes from all training samples. Additionally, we
introduce re-balanced smooth labels to regulate the cross-entropy loss, guiding
the model to pay more attention to the minor classes by utilizing the extra
information regarding the label distribution of the imbalanced training data.
Extensive experiments on different datasets and backbones show that the two
proposed modules work together to regularize the model and achieve
state-of-the-art performance under the imbalanced FER task. Code is available
at https://github.com/zyh-uaiaaaa.Comment: Accepted by NeurIPS202
Robust Facial Expression Recognition with Convolutional Visual Transformers
Facial Expression Recognition (FER) in the wild is extremely challenging due
to occlusions, variant head poses, face deformation and motion blur under
unconstrained conditions. Although substantial progresses have been made in
automatic FER in the past few decades, previous studies are mainly designed for
lab-controlled FER. Real-world occlusions, variant head poses and other issues
definitely increase the difficulty of FER on account of these
information-deficient regions and complex backgrounds. Different from previous
pure CNNs based methods, we argue that it is feasible and practical to
translate facial images into sequences of visual words and perform expression
recognition from a global perspective. Therefore, we propose Convolutional
Visual Transformers to tackle FER in the wild by two main steps. First, we
propose an attentional selective fusion (ASF) for leveraging the feature maps
generated by two-branch CNNs. The ASF captures discriminative information by
fusing multiple features with global-local attention. The fused feature maps
are then flattened and projected into sequences of visual words. Second,
inspired by the success of Transformers in natural language processing, we
propose to model relationships between these visual words with global
self-attention. The proposed method are evaluated on three public in-the-wild
facial expression datasets (RAF-DB, FERPlus and AffectNet). Under the same
settings, extensive experiments demonstrate that our method shows superior
performance over other methods, setting new state of the art on RAF-DB with
88.14%, FERPlus with 88.81% and AffectNet with 61.85%. We also conduct
cross-dataset evaluation on CK+ show the generalization capability of the
proposed method
Learn From All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition
Noisy label Facial Expression Recognition (FER) is more challenging than
traditional noisy label classification tasks due to the inter-class similarity
and the annotation ambiguity. Recent works mainly tackle this problem by
filtering out large-loss samples. In this paper, we explore dealing with noisy
labels from a new feature-learning perspective. We find that FER models
remember noisy samples by focusing on a part of the features that can be
considered related to the noisy labels instead of learning from the whole
features that lead to the latent truth. Inspired by that, we propose a novel
Erasing Attention Consistency (EAC) method to suppress the noisy samples during
the training process automatically. Specifically, we first utilize the flip
semantic consistency of facial images to design an imbalanced framework. We
then randomly erase input images and use flip attention consistency to prevent
the model from focusing on a part of the features. EAC significantly
outperforms state-of-the-art noisy label FER methods and generalizes well to
other tasks with a large number of classes like CIFAR100 and Tiny-ImageNet. The
code is available at
https://github.com/zyh-uaiaaaa/Erasing-Attention-Consistency.Comment: ECCV202
- …