21 research outputs found
Vision Transformer with Attentive Pooling for Robust Facial Expression Recognition
Facial Expression Recognition (FER) in the wild is an extremely challenging
task. Recently, some Vision Transformers (ViT) have been explored for FER, but
most of them perform inferiorly compared to Convolutional Neural Networks
(CNN). This is mainly because the new proposed modules are difficult to
converge well from scratch due to lacking inductive bias and easy to focus on
the occlusion and noisy areas. TransFER, a representative transformer-based
method for FER, alleviates this with multi-branch attention dropping but brings
excessive computations. On the contrary, we present two attentive pooling (AP)
modules to pool noisy features directly. The AP modules include Attentive Patch
Pooling (APP) and Attentive Token Pooling (ATP). They aim to guide the model to
emphasize the most discriminative features while reducing the impacts of less
relevant features. The proposed APP is employed to select the most informative
patches on CNN features, and ATP discards unimportant tokens in ViT. Being
simple to implement and without learnable parameters, the APP and ATP
intuitively reduce the computational cost while boosting the performance by
ONLY pursuing the most discriminative features. Qualitative results demonstrate
the motivations and effectiveness of our attentive poolings. Besides,
quantitative results on six in-the-wild datasets outperform other
state-of-the-art methods.Comment: Codes will be public on https://github.com/youqingxiaozhua/APVi
NCL++: Nested Collaborative Learning for Long-Tailed Visual Recognition
Long-tailed visual recognition has received increasing attention in recent
years. Due to the extremely imbalanced data distribution in long-tailed
learning, the learning process shows great uncertainties. For example, the
predictions of different experts on the same image vary remarkably despite the
same training settings. To alleviate the uncertainty, we propose a Nested
Collaborative Learning (NCL++) which tackles the long-tailed learning problem
by a collaborative learning. To be specific, the collaborative learning
consists of two folds, namely inter-expert collaborative learning (InterCL) and
intra-expert collaborative learning (IntraCL). In-terCL learns multiple experts
collaboratively and concurrently, aiming to transfer the knowledge among
different experts. IntraCL is similar to InterCL, but it aims to conduct the
collaborative learning on multiple augmented copies of the same image within
the single expert. To achieve the collaborative learning in long-tailed
learning, the balanced online distillation is proposed to force the consistent
predictions among different experts and augmented copies, which reduces the
learning uncertainties. Moreover, in order to improve the meticulous
distinguishing ability on the confusing categories, we further propose a Hard
Category Mining (HCM), which selects the negative categories with high
predicted scores as the hard categories. Then, the collaborative learning is
formulated in a nested way, in which the learning is conducted on not just all
categories from a full perspective but some hard categories from a partial
perspective. Extensive experiments manifest the superiority of our method with
outperforming the state-of-the-art whether with using a single model or an
ensemble. The code will be publicly released.Comment: arXiv admin note: text overlap with arXiv:2203.1535
Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes
Modern autonomous driving systems are typically divided into three main
tasks: perception, prediction, and planning. The planning task involves
predicting the trajectory of the ego vehicle based on inputs from both internal
intention and the external environment, and manipulating the vehicle
accordingly. Most existing works evaluate their performance on the nuScenes
dataset using the L2 error and collision rate between the predicted
trajectories and the ground truth. In this paper, we reevaluate these existing
evaluation metrics and explore whether they accurately measure the superiority
of different methods. Specifically, we design an MLP-based method that takes
raw sensor data (e.g., past trajectory, velocity, etc.) as input and directly
outputs the future trajectory of the ego vehicle, without using any perception
or prediction information such as camera images or LiDAR. Our simple method
achieves similar end-to-end planning performance on the nuScenes dataset with
other perception-based methods, reducing the average L2 error by about 20%.
Meanwhile, the perception-based methods have an advantage in terms of collision
rate. We further conduct in-depth analysis and provide new insights into the
factors that are critical for the success of the planning task on nuScenes
dataset. Our observation also indicates that we need to rethink the current
open-loop evaluation scheme of end-to-end autonomous driving in nuScenes. Codes
are available at https://github.com/E2E-AD/AD-MLP.Comment: Technical report. Code is availabl
Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation
In this paper, we study the problem of end-to-end multi-person pose
estimation. State-of-the-art solutions adopt the DETR-like framework, and
mainly develop the complex decoder, e.g., regarding pose estimation as keypoint
box detection and combining with human detection in ED-Pose, hierarchically
predicting with pose decoder and joint (keypoint) decoder in PETR. We present a
simple yet effective transformer approach, named Group Pose. We simply regard
-keypoint pose estimation as predicting a set of keypoint
positions, each from a keypoint query, as well as representing each pose with
an instance query for scoring pose predictions. Motivated by the intuition
that the interaction, among across-instance queries of different types, is not
directly helpful, we make a simple modification to decoder self-attention. We
replace single self-attention over all the queries with two
subsequent group self-attentions: (i) within-instance self-attention, with
each over keypoint queries and one instance query, and (ii)
same-type across-instance self-attention, each over queries of the same
type. The resulting decoder removes the interaction among across-instance
type-different queries, easing the optimization and thus improving the
performance. Experimental results on MS COCO and CrowdPose show that our
approach without human box supervision is superior to previous methods with
complex decoders, and even is slightly better than ED-Pose that uses human box
supervision. and code
are available.Comment: Accepted by ICCV 202
Defending Black-box Skeleton-based Human Activity Classifiers
Deep learning has been regarded as the `go to' solution for many tasks today,
but its intrinsic vulnerability to malicious attacks has become a major
concern. The vulnerability is affected by a variety of factors including
models, tasks, data, and attackers. Consequently, methods such as Adversarial
Training and Randomized Smoothing have been proposed to tackle the problem in a
wide range of applications. In this paper, we investigate skeleton-based Human
Activity Recognition, which is an important type of time-series data but
under-explored in defense against attacks. Our method is featured by (1) a new
Bayesian Energy-based formulation of robust discriminative classifiers, (2) a
new parameterization of the adversarial sample manifold of actions, and (3) a
new post-train Bayesian treatment on both the adversarial samples and the
classifier. We name our framework Bayesian Energy-based Adversarial Training or
BEAT. BEAT is straightforward but elegant, which turns vulnerable black-box
classifiers into robust ones without sacrificing accuracy. It demonstrates
surprising and universal effectiveness across a wide range of action
classifiers and datasets, under various attacks