242 research outputs found
Encoding Feature Maps of CNNs for Action Recognition
CVPR International Workshop and Competition on Action Recognition with a Large Number of ClassesWe describe our approach for action classification in the THUMOS Challenge 2015. Our approach is based on two types of features, improved dense trajectories and CNN features. For trajectory features, we extract HOG, HOF, MBHx, and MBHy descriptors and apply Fisher vector encoding. For CNN features, we utilize a recent deep CNN model, VGG19, to capture appearance features and use VLAD encoding to encode/pool convolutional feature maps which shows better performance than average pooling of feature maps and full-connected activation features. After concatenating them, we train a linear SVM classifier for each class in a one-vs-all scheme
Self-Ensemling for 3D Point Cloud Domain Adaption
Recently 3D point cloud learning has been a hot topic in computer vision and
autonomous driving. Due to the fact that it is difficult to manually annotate a
qualitative large-scale 3D point cloud dataset, unsupervised domain adaptation
(UDA) is popular in 3D point cloud learning which aims to transfer the learned
knowledge from the labeled source domain to the unlabeled target domain.
However, the generalization and reconstruction errors caused by domain shift
with simply-learned model are inevitable which substantially hinder the model's
capability from learning good representations. To address these issues, we
propose an end-to-end self-ensembling network (SEN) for 3D point cloud domain
adaption tasks. Generally, our SEN resorts to the advantages of Mean Teacher
and semi-supervised learning, and introduces a soft classification loss and a
consistency loss, aiming to achieve consistent generalization and accurate
reconstruction. In SEN, a student network is kept in a collaborative manner
with supervised learning and self-supervised learning, and a teacher network
conducts temporal consistency to learn useful representations and ensure the
quality of point clouds reconstruction. Extensive experiments on several 3D
point cloud UDA benchmarks show that our SEN outperforms the state-of-the-art
methods on both classification and segmentation tasks. Moreover, further
analysis demonstrates that our SEN also achieves better reconstruction results
MIMIC: Mask Image Pre-training with Mix Contrastive Fine-tuning for Facial Expression Recognition
Cutting-edge research in facial expression recognition (FER) currently favors
the utilization of convolutional neural networks (CNNs) backbone which is
supervisedly pre-trained on face recognition datasets for feature extraction.
However, due to the vast scale of face recognition datasets and the high cost
associated with collecting facial labels, this pre-training paradigm incurs
significant expenses. Towards this end, we propose to pre-train vision
Transformers (ViTs) through a self-supervised approach on a mid-scale general
image dataset. In addition, when compared with the domain disparity existing
between face datasets and FER datasets, the divergence between general datasets
and FER datasets is more pronounced. Therefore, we propose a contrastive
fine-tuning approach to effectively mitigate this domain disparity.
Specifically, we introduce a novel FER training paradigm named Mask Image
pre-training with MIx Contrastive fine-tuning (MIMIC). In the initial phase, we
pre-train the ViT via masked image reconstruction on general images.
Subsequently, in the fine-tuning stage, we introduce a mix-supervised
contrastive learning process, which enhances the model with a more extensive
range of positive samples by the mixing strategy. Through extensive experiments
conducted on three benchmark datasets, we demonstrate that our MIMIC
outperforms the previous training paradigm, showing its capability to learn
better representations. Remarkably, the results indicate that the vanilla ViT
can achieve impressive performance without the need for intricate,
auxiliary-designed modules. Moreover, when scaling up the model size, MIMIC
exhibits no performance saturation and is superior to the current
state-of-the-art methods
AU-Supervised Convolutional Vision Transformers for Synthetic Facial Expression Recognition
The paper describes our proposed methodology for the six basic expression
classification track of Affective Behavior Analysis in-the-wild (ABAW)
Competition 2022. In Learing from Synthetic Data(LSD) task, facial expression
recognition (FER) methods aim to learn the representation of expression from
the artificially generated data and generalise to real data. Because of the
ambiguous of the synthetic data and the objectivity of the facial Action Unit
(AU), we resort to the AU information for performance boosting, and make
contributions as follows. First, to adapt the model to synthetic scenarios, we
use the knowledge from pre-trained large-scale face recognition data. Second,
we propose a conceptually-new framework, termed as AU-Supervised Convolutional
Vision Transformers (AU-CVT), which clearly improves the performance of FER by
jointly training auxiliary datasets with AU or pseudo AU labels. Our AU-CVT
achieved F1 score as , accuracy as on the validation set. The
source code of our work is publicly available online:
https://github.com/msy1412/ABAW
- …