16 research outputs found
Loss Guided Activation for Action Recognition in Still Images
One significant problem of deep-learning based human action recognition is
that it can be easily misled by the presence of irrelevant objects or
backgrounds. Existing methods commonly address this problem by employing
bounding boxes on the target humans as part of the input, in both training and
testing stages. This requirement of bounding boxes as part of the input is
needed to enable the methods to ignore irrelevant contexts and extract only
human features. However, we consider this solution is inefficient, since the
bounding boxes might not be available. Hence, instead of using a person
bounding box as an input, we introduce a human-mask loss to automatically guide
the activations of the feature maps to the target human who is performing the
action, and hence suppress the activations of misleading contexts. We propose a
multi-task deep learning method that jointly predicts the human action class
and human location heatmap. Extensive experiments demonstrate our approach is
more robust compared to the baseline methods under the presence of irrelevant
misleading contexts. Our method achieves 94.06\% and 40.65\% (in terms of mAP)
on Stanford40 and MPII dataset respectively, which are 3.14\% and 12.6\%
relative improvements over the best results reported in the literature, and
thus set new state-of-the-art results. Additionally, unlike some existing
methods, we eliminate the requirement of using a person bounding box as an
input during testing.Comment: Accepted to appear in ACCV 201
Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection
Multi-label image classification is a fundamental but challenging task
towards general visual understanding. Existing methods found the region-level
cues (e.g., features from RoIs) can facilitate multi-label classification.
Nevertheless, such methods usually require laborious object-level annotations
(i.e., object labels and bounding boxes) for effective learning of the
object-level visual features. In this paper, we propose a novel and efficient
deep framework to boost multi-label classification by distilling knowledge from
weakly-supervised detection task without bounding box annotations.
Specifically, given the image-level annotations, (1) we first develop a
weakly-supervised detection (WSD) model, and then (2) construct an end-to-end
multi-label image classification framework augmented by a knowledge
distillation module that guides the classification model by the WSD model
according to the class-level predictions for the whole image and the
object-level visual features for object RoIs. The WSD model is the teacher
model and the classification model is the student model. After this cross-task
knowledge distillation, the performance of the classification model is
significantly improved and the efficiency is maintained since the WSD model can
be safely discarded in the test phase. Extensive experiments on two large-scale
datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior
performances over the state-of-the-art methods on both performance and
efficiency.Comment: accepted by ACM Multimedia 2018, 9 pages, 4 figures, 5 table
Multi-layered Semantic Representation Network for Multi-label Image Classification
Multi-label image classification (MLIC) is a fundamental and practical task,
which aims to assign multiple possible labels to an image. In recent years,
many deep convolutional neural network (CNN) based approaches have been
proposed which model label correlations to discover semantics of labels and
learn semantic representations of images. This paper advances this research
direction by improving both the modeling of label correlations and the learning
of semantic representations. On the one hand, besides the local semantics of
each label, we propose to further explore global semantics shared by multiple
labels. On the other hand, existing approaches mainly learn the semantic
representations at the last convolutional layer of a CNN. But it has been noted
that the image representations of different layers of CNN capture different
levels or scales of features and have different discriminative abilities. We
thus propose to learn semantic representations at multiple convolutional
layers. To this end, this paper designs a Multi-layered Semantic Representation
Network (MSRN) which discovers both local and global semantics of labels
through modeling label correlations and utilizes the label semantics to guide
the semantic representations learning at multiple layers through an attention
mechanism. Extensive experiments on four benchmark datasets including VOC 2007,
COCO, NUS-WIDE, and Apparel show a competitive performance of the proposed MSRN
against state-of-the-art models