6,763 research outputs found
Automatic annotation for weakly supervised learning of detectors
PhDObject detection in images and action detection in videos are among the most widely studied
computer vision problems, with applications in consumer photography, surveillance, and automatic
media tagging. Typically, these standard detectors are fully supervised, that is they require
a large body of training data where the locations of the objects/actions in images/videos have
been manually annotated. With the emergence of digital media, and the rise of high-speed internet,
raw images and video are available for little to no cost. However, the manual annotation
of object and action locations remains tedious, slow, and expensive. As a result there has been
a great interest in training detectors with weak supervision where only the presence or absence
of object/action in image/video is needed, not the location. This thesis presents approaches for
weakly supervised learning of object/action detectors with a focus on automatically annotating
object and action locations in images/videos using only binary weak labels indicating the presence
or absence of object/action in images/videos.
First, a framework for weakly supervised learning of object detectors in images is presented.
In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically
annotating object locations in weakly labelled data is presented which, unlike existing
approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial
annotation is then used to start an iterative process in which standard object detectors are used to
refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift
from the object of interest, a scheme for detecting model drift is also presented. Furthermore,
unlike most other methods, our weakly supervised approach is evaluated on data without manual
pose (object orientation) annotation.
Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues,
is carried out. From the analysis, a new method based on negative mining (NegMine) is presented
for the initial annotation of both object and action data. The NegMine based approach is a
much simpler formulation using only inter-class measure and requires no complex combinatorial
optimisation but can still meet or outperform existing approaches including the previously pre3
sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing
approaches to boost their performance.
Finally, the thesis will take a step back and look at the use of generic object detectors as prior
knowledge in weakly supervised learning of object detectors. These generic object detectors are
typically based on sampling saliency maps that indicate if a pixel belongs to the background
or foreground. A new approach to generating saliency maps is presented that, unlike existing
approaches, looks beyond the current image of interest and into images similar to the current
image. We show that our generic object proposal method can be used by itself to annotate the
weakly labelled object data with surprisingly high accuracy
Automatic detection of salient objects and spatial relations in videos for a video database system
Cataloged from PDF version of article.Multimedia databases have gained popularity due to rapidly growing quantities of multimedia data and the need to perform efficient
indexing, retrieval and analysis of this data. One downside of multimedia databases is the necessity to process the data for feature extraction
and labeling prior to storage and querying. Huge amount of data makes it impossible to complete this task manually. We propose a
tool for the automatic detection and tracking of salient objects, and derivation of spatio-temporal relations between them in video. Our
system aims to reduce the work for manual selection and labeling of objects significantly by detecting and tracking the salient objects, and
hence, requiring to enter the label for each object only once within each shot instead of specifying the labels for each object in every frame
they appear. This is also required as a first step in a fully-automatic video database management system in which the labeling should also
be done automatically. The proposed framework covers a scalable architecture for video processing and stages of shot boundary detection,
salient object detection and tracking, and knowledge-base construction for effective spatio-temporal object querying.
(c) 2008 Elsevier B.V. All rights reserved
- β¦