34 research outputs found
Utilising Visual Attention Cues for Vehicle Detection and Tracking
Advanced Driver-Assistance Systems (ADAS) have been attracting attention from
many researchers. Vision-based sensors are the closest way to emulate human
driver visual behavior while driving. In this paper, we explore possible ways
to use visual attention (saliency) for object detection and tracking. We
investigate: 1) How a visual attention map such as a \emph{subjectness}
attention or saliency map and an \emph{objectness} attention map can facilitate
region proposal generation in a 2-stage object detector; 2) How a visual
attention map can be used for tracking multiple objects. We propose a neural
network that can simultaneously detect objects as and generate objectness and
subjectness maps to save computational power. We further exploit the visual
attention map during tracking using a sequential Monte Carlo probability
hypothesis density (PHD) filter. The experiments are conducted on KITTI and
DETRAC datasets. The use of visual attention and hierarchical features has
shown a considerable improvement of 8\% in object detection which
effectively increased tracking performance by 4\% on KITTI dataset.Comment: Accepted in ICPR202
Transferring Cross-domain Knowledge for Video Sign Language Recognition
Word-level sign language recognition (WSLR) is a fundamental task in sign
language interpretation. It requires models to recognize isolated sign words
from videos. However, annotating WSLR data needs expert knowledge, thus
limiting WSLR dataset acquisition. On the contrary, there are abundant
subtitled sign news videos on the internet. Since these videos have no
word-level annotation and exhibit a large domain gap from isolated signs, they
cannot be directly used for training WSLR models. We observe that despite the
existence of a large domain gap, isolated and news signs share the same visual
concepts, such as hand gestures and body movements. Motivated by this
observation, we propose a novel method that learns domain-invariant visual
concepts and fertilizes WSLR models by transferring knowledge of subtitled news
sign to them. To this end, we extract news signs using a base WSLR model, and
then design a classifier jointly trained on news and isolated signs to coarsely
align these two domain features. In order to learn domain-invariant features
within each class and suppress domain-specific features, our method further
resorts to an external memory to store the class centroids of the aligned news
signs. We then design a temporal attention based on the learnt descriptor to
improve recognition performance. Experimental results on standard WSLR datasets
show that our method outperforms previous state-of-the-art methods
significantly. We also demonstrate the effectiveness of our method on
automatically localizing signs from sign news, achieving 28.1 for [email protected]: CVPR2020 (oral) preprin
StarNet: towards Weakly Supervised Few-Shot Object Detection
Few-shot detection and classification have advanced significantly in recent
years. Yet, detection approaches require strong annotation (bounding boxes)
both for pre-training and for adaptation to novel classes, and classification
approaches rarely provide localization of objects in the scene. In this paper,
we introduce StarNet - a few-shot model featuring an end-to-end differentiable
non-parametric star-model detection and classification head. Through this head,
the backbone is meta-trained using only image-level labels to produce good
features for jointly localizing and classifying previously unseen categories of
few-shot test tasks using a star-model that geometrically matches between the
query and support images (to find corresponding object instances). Being a
few-shot detector, StarNet does not require any bounding box annotations,
neither during pre-training nor for novel classes adaptation. It can thus be
applied to the previously unexplored and challenging task of Weakly Supervised
Few-Shot Object Detection (WS-FSOD), where it attains significant improvements
over the baselines. In addition, StarNet shows significant gains on few-shot
classification benchmarks that are less cropped around the objects (where
object localization is key)