209,457 research outputs found
Pose-disentangled Contrastive Learning for Self-supervised Facial Representation
Self-supervised facial representation has recently attracted increasing
attention due to its ability to perform face understanding without relying on
large-scale annotated datasets heavily. However, analytically, current
contrastive-based self-supervised learning still performs unsatisfactorily for
learning facial representation. More specifically, existing contrastive
learning (CL) tends to learn pose-invariant features that cannot depict the
pose details of faces, compromising the learning performance. To conquer the
above limitation of CL, we propose a novel Pose-disentangled Contrastive
Learning (PCL) method for general self-supervised facial representation. Our
PCL first devises a pose-disentangled decoder (PDD) with a delicately designed
orthogonalizing regulation, which disentangles the pose-related features from
the face-aware features; therefore, pose-related and other pose-unrelated
facial information could be performed in individual subnetworks and do not
affect each other's training. Furthermore, we introduce a pose-related
contrastive learning scheme that learns pose-related information based on data
augmentation of the same image, which would deliver more effective face-aware
representation for various downstream tasks. We conducted a comprehensive
linear evaluation on three challenging downstream facial understanding tasks,
i.e., facial expression recognition, face recognition, and AU detection.
Experimental results demonstrate that our method outperforms cutting-edge
contrastive and other self-supervised learning methods with a great margin
MAMAF-Net: Motion-Aware and Multi-Attention Fusion Network for Stroke Diagnosis
Stroke is a major cause of mortality and disability worldwide from which one
in four people are in danger of incurring in their lifetime. The pre-hospital
stroke assessment plays a vital role in identifying stroke patients accurately
to accelerate further examination and treatment in hospitals. Accordingly, the
National Institutes of Health Stroke Scale (NIHSS), Cincinnati Pre-hospital
Stroke Scale (CPSS) and Face Arm Speed Time (F.A.S.T.) are globally known tests
for stroke assessment. However, the validity of these tests is skeptical in the
absence of neurologists. Therefore, in this study, we propose a motion-aware
and multi-attention fusion network (MAMAF-Net) that can detect stroke from
multimodal examination videos. Contrary to other studies on stroke detection
from video analysis, our study for the first time proposes an end-to-end
solution from multiple video recordings of each subject with a dataset
encapsulating stroke, transient ischemic attack (TIA), and healthy controls.
The proposed MAMAF-Net consists of motion-aware modules to sense the mobility
of patients, attention modules to fuse the multi-input video data, and 3D
convolutional layers to perform diagnosis from the attention-based extracted
features. Experimental results over the collected StrokeDATA dataset show that
the proposed MAMAF-Net achieves a successful detection of stroke with 93.62%
sensitivity and 95.33% AUC score
Beyond saliency: understanding convolutional neural networks from saliency prediction on layer-wise relevance propagation
Despite the tremendous achievements of deep convolutional neural networks
(CNNs) in many computer vision tasks, understanding how they actually work
remains a significant challenge. In this paper, we propose a novel two-step
understanding method, namely Salient Relevance (SR) map, which aims to shed
light on how deep CNNs recognize images and learn features from areas, referred
to as attention areas, therein. Our proposed method starts out with a
layer-wise relevance propagation (LRP) step which estimates a pixel-wise
relevance map over the input image. Following, we construct a context-aware
saliency map, SR map, from the LRP-generated map which predicts areas close to
the foci of attention instead of isolated pixels that LRP reveals. In human
visual system, information of regions is more important than of pixels in
recognition. Consequently, our proposed approach closely simulates human
recognition. Experimental results using the ILSVRC2012 validation dataset in
conjunction with two well-established deep CNN models, AlexNet and VGG-16,
clearly demonstrate that our proposed approach concisely identifies not only
key pixels but also attention areas that contribute to the underlying neural
network's comprehension of the given images. As such, our proposed SR map
constitutes a convenient visual interface which unveils the visual attention of
the network and reveals which type of objects the model has learned to
recognize after training. The source code is available at
https://github.com/Hey1Li/Salient-Relevance-Propagation.Comment: 35 pages, 15 figure
- …