142 research outputs found
Utilising Visual Attention Cues for Vehicle Detection and Tracking
Advanced Driver-Assistance Systems (ADAS) have been attracting attention from
many researchers. Vision-based sensors are the closest way to emulate human
driver visual behavior while driving. In this paper, we explore possible ways
to use visual attention (saliency) for object detection and tracking. We
investigate: 1) How a visual attention map such as a \emph{subjectness}
attention or saliency map and an \emph{objectness} attention map can facilitate
region proposal generation in a 2-stage object detector; 2) How a visual
attention map can be used for tracking multiple objects. We propose a neural
network that can simultaneously detect objects as and generate objectness and
subjectness maps to save computational power. We further exploit the visual
attention map during tracking using a sequential Monte Carlo probability
hypothesis density (PHD) filter. The experiments are conducted on KITTI and
DETRAC datasets. The use of visual attention and hierarchical features has
shown a considerable improvement of 8\% in object detection which
effectively increased tracking performance by 4\% on KITTI dataset.Comment: Accepted in ICPR202
Automatic annotation for weakly supervised learning of detectors
PhDObject detection in images and action detection in videos are among the most widely studied
computer vision problems, with applications in consumer photography, surveillance, and automatic
media tagging. Typically, these standard detectors are fully supervised, that is they require
a large body of training data where the locations of the objects/actions in images/videos have
been manually annotated. With the emergence of digital media, and the rise of high-speed internet,
raw images and video are available for little to no cost. However, the manual annotation
of object and action locations remains tedious, slow, and expensive. As a result there has been
a great interest in training detectors with weak supervision where only the presence or absence
of object/action in image/video is needed, not the location. This thesis presents approaches for
weakly supervised learning of object/action detectors with a focus on automatically annotating
object and action locations in images/videos using only binary weak labels indicating the presence
or absence of object/action in images/videos.
First, a framework for weakly supervised learning of object detectors in images is presented.
In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically
annotating object locations in weakly labelled data is presented which, unlike existing
approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial
annotation is then used to start an iterative process in which standard object detectors are used to
refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift
from the object of interest, a scheme for detecting model drift is also presented. Furthermore,
unlike most other methods, our weakly supervised approach is evaluated on data without manual
pose (object orientation) annotation.
Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues,
is carried out. From the analysis, a new method based on negative mining (NegMine) is presented
for the initial annotation of both object and action data. The NegMine based approach is a
much simpler formulation using only inter-class measure and requires no complex combinatorial
optimisation but can still meet or outperform existing approaches including the previously pre3
sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing
approaches to boost their performance.
Finally, the thesis will take a step back and look at the use of generic object detectors as prior
knowledge in weakly supervised learning of object detectors. These generic object detectors are
typically based on sampling saliency maps that indicate if a pixel belongs to the background
or foreground. A new approach to generating saliency maps is presented that, unlike existing
approaches, looks beyond the current image of interest and into images similar to the current
image. We show that our generic object proposal method can be used by itself to annotate the
weakly labelled object data with surprisingly high accuracy
Instance Embedding Transfer to Unsupervised Video Object Segmentation
We propose a method for unsupervised video object segmentation by
transferring the knowledge encapsulated in image-based instance embedding
networks. The instance embedding network produces an embedding vector for each
pixel that enables identifying all pixels belonging to the same object. Though
trained on static images, the instance embeddings are stable over consecutive
video frames, which allows us to link objects together over time. Thus, we
adapt the instance networks trained on static images to video object
segmentation and incorporate the embeddings with objectness and optical flow
features, without model retraining or online fine-tuning. The proposed method
outperforms state-of-the-art unsupervised segmentation methods in the DAVIS
dataset and the FBMS dataset.Comment: To appear in CVPR 201
BING: Binarized normed gradients for objectness estimation at 300fps
Training a generic objectness measure to produce object proposals has recently become of significant interest. We observe that generic objects with well-defined closed boundaries can be detected by looking at the norm of gradients, with a suitable resizing of their corresponding image windows to a small fixed size. Based on this observation and computational reasons, we propose to resize the window to 8 Γ 8 and use the norm of the gradients as a simple 64D feature to describe it, for explicitly training a generic objectness measure. We further show how the binarized version of this feature, namely binarized normed gradients (BING), can be used for efficient objectness estimation, which requires only a few atomic operations (e.g., add, bitwise shift, etc.). To improve localization quality of the proposals while maintaining efficiency, we propose a novel fast segmentation method and demonstrate its effectiveness for improving BINGβs localization performance, when used in multithresholding straddling expansion (MTSE) postprocessing. On the challenging PASCAL VOC2007 dataset, using 1000 proposals per image and intersectionover- union threshold of 0.5, our proposal method achieves a 95.6% object detection rate and 78.6% mean average best overlap in less than 0.005 second per image
- β¦