3,393 research outputs found
Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in
Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors
mounted at different locations to monitor the driver and the vehicle's interior
scene and employ decision-level fusion to integrate these heterogenous data.
However, this fusion method may not fully utilize the complementarity of
different data sources and may overlook their relative importance. To address
these limitations, we propose a novel multiview multimodal driver monitoring
system based on feature-level fusion through multi-head self-attention (MHSA).
We demonstrate its effectiveness by comparing it against four alternative
fusion strategies (Sum, Conv, SE, and AFF). We also present a novel
GPU-friendly supervised contrastive learning framework SuMoCo to learn better
representations. Furthermore, We fine-grained the test split of the DAD dataset
to enable the multi-class recognition of drivers' activities. Experiments on
this enhanced database demonstrate that 1) the proposed MHSA-based fusion
method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and
2) training MHSA with patch masking can improve its robustness against
modality/view collapses. The code and annotations are publicly available.Comment: 9 pages (1 for reference); accepted by the 6th Multimodal Learning
and Applications Workshop (MULA) at CVPR 202
MHSA-Net: Multi-Head Self-Attention Network for Occluded Person Re-Identification
This paper presents a novel person re-identification model, named Multi-Head
Self-Attention Network (MHSA-Net), to prune unimportant information and capture
key local information from person images. MHSA-Net contains two main novel
components: Multi-Head Self-Attention Branch (MHSAB) and Attention Competition
Mechanism (ACM). The MHSAM adaptively captures key local person information,
and then produces effective diversity embeddings of an image for the person
matching. The ACM further helps filter out attention noise and non-key
information. Through extensive ablation studies, we verified that the
Structured Self-Attention Branch and Attention Competition Mechanism both
contribute to the performance improvement of the MHSA-Net. Our MHSA-Net
achieves state-of-the-art performance especially on images with occlusions. We
have released our models (and will release the source codes after the paper is
accepted) on https://github.com/hongchenphd/MHSA-Net.Comment: Submitted to IEEE Transactions on Image Processing (TIP
Manifold-Preserving Transformers are Effective for Short-Long Range Encoding
Multi-head self-attention-based Transformers have shown promise in different
learning tasks. Albeit these models exhibit significant improvement in
understanding short-term and long-term contexts from sequences, encoders of
Transformers and their variants fail to preserve layer-wise contextual
information. Transformers usually project tokens onto sparse manifolds and fail
to preserve mathematical equivalence among the token representations. In this
work, we propose TransJect, an encoder model that guarantees a theoretical
bound for layer-wise distance preservation between a pair of tokens. We propose
a simple alternative to dot-product attention to ensure Lipschitz continuity.
This allows TransJect to learn injective mappings to transform token
representations to different manifolds with similar topology and preserve
Euclidean distance between every pair of tokens in subsequent layers.
Evaluations across multiple benchmark short- and long-sequence classification
tasks show maximum improvements of 6.8% and 5.9%, respectively, over the
variants of Transformers. Additionally, TransJect displays 79% better
performance than Transformer on the language modeling task. We further
highlight the shortcomings of multi-head self-attention from the statistical
physics viewpoint. Although multi-head self-attention was incepted to learn
different abstraction levels within the networks, our empirical analyses
suggest that different attention heads learn randomly and unorderly. In
contrast, TransJect adapts a mixture of experts for regularization; these
experts are more orderly and balanced and learn different sparse
representations from the input sequences. TransJect exhibits very low entropy
and can be efficiently scaled to larger depths.Comment: 17 pages, 7 figures, 5 tables, Findings of the Association for
Computational Linguistics: EMNLP202
Class token and knowledge distillation for multi-head self-attention speaker verification systems
This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a new sampling estimation of the class token. In this approach, the class token is obtained by sampling from a list of several trainable vectors. This strategy introduces uncertainty that helps to generalize better compared to a single initialization as it is shown in the experiments. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings
- …