460 research outputs found
Weakly Supervised Object Localization with Multi-fold Multiple Instance Learning
Object category localization is a challenging problem in computer vision.
Standard supervised training requires bounding box annotations of object
instances. This time-consuming annotation process is sidestepped in weakly
supervised learning. In this case, the supervised information is restricted to
binary labels that indicate the absence/presence of object instances in the
image, without their locations. We follow a multiple-instance learning approach
that iteratively trains the detector and infers the object locations in the
positive training images. Our main contribution is a multi-fold multiple
instance learning procedure, which prevents training from prematurely locking
onto erroneous object locations. This procedure is particularly important when
using high-dimensional representations, such as Fisher vectors and
convolutional neural network features. We also propose a window refinement
method, which improves the localization accuracy by incorporating an objectness
prior. We present a detailed experimental evaluation using the PASCAL VOC 2007
dataset, which verifies the effectiveness of our approach.Comment: To appear in IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI
Automatic annotation for weakly supervised learning of detectors
PhDObject detection in images and action detection in videos are among the most widely studied
computer vision problems, with applications in consumer photography, surveillance, and automatic
media tagging. Typically, these standard detectors are fully supervised, that is they require
a large body of training data where the locations of the objects/actions in images/videos have
been manually annotated. With the emergence of digital media, and the rise of high-speed internet,
raw images and video are available for little to no cost. However, the manual annotation
of object and action locations remains tedious, slow, and expensive. As a result there has been
a great interest in training detectors with weak supervision where only the presence or absence
of object/action in image/video is needed, not the location. This thesis presents approaches for
weakly supervised learning of object/action detectors with a focus on automatically annotating
object and action locations in images/videos using only binary weak labels indicating the presence
or absence of object/action in images/videos.
First, a framework for weakly supervised learning of object detectors in images is presented.
In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically
annotating object locations in weakly labelled data is presented which, unlike existing
approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial
annotation is then used to start an iterative process in which standard object detectors are used to
refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift
from the object of interest, a scheme for detecting model drift is also presented. Furthermore,
unlike most other methods, our weakly supervised approach is evaluated on data without manual
pose (object orientation) annotation.
Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues,
is carried out. From the analysis, a new method based on negative mining (NegMine) is presented
for the initial annotation of both object and action data. The NegMine based approach is a
much simpler formulation using only inter-class measure and requires no complex combinatorial
optimisation but can still meet or outperform existing approaches including the previously pre3
sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing
approaches to boost their performance.
Finally, the thesis will take a step back and look at the use of generic object detectors as prior
knowledge in weakly supervised learning of object detectors. These generic object detectors are
typically based on sampling saliency maps that indicate if a pixel belongs to the background
or foreground. A new approach to generating saliency maps is presented that, unlike existing
approaches, looks beyond the current image of interest and into images similar to the current
image. We show that our generic object proposal method can be used by itself to annotate the
weakly labelled object data with surprisingly high accuracy
Revisiting knowledge transfer for training object class detectors
We propose to revisit knowledge transfer for training object detectors on
target classes from weakly supervised training images, helped by a set of
source classes with bounding-box annotations. We present a unified knowledge
transfer framework based on training a single neural network multi-class object
detector over all source classes, organized in a semantic hierarchy. This
generates proposals with scores at multiple levels in the hierarchy, which we
use to explore knowledge transfer over a broad range of generality, ranging
from class-specific (bicycle to motorbike) to class-generic (objectness to any
class). Experiments on the 200 object classes in the ILSVRC 2013 detection
dataset show that our technique: (1) leads to much better performance on the
target classes (70.3% CorLoc, 36.9% mAP) than a weakly supervised baseline
which uses manually engineered objectness [11] (50.5% CorLoc, 25.4% mAP). (2)
delivers target object detectors reaching 80% of the mAP of their fully
supervised counterparts. (3) outperforms the best reported transfer learning
results on this dataset (+41% CorLoc and +3% mAP over [18, 46], +16.2% mAP over
[32]). Moreover, we also carry out several across-dataset knowledge transfer
experiments [27, 24, 35] and find that (4) our technique outperforms the weakly
supervised baseline in all dataset pairs by 1.5x-1.9x, establishing its general
applicability.Comment: CVPR 1
Object Level Deep Feature Pooling for Compact Image Representation
Convolutional Neural Network (CNN) features have been successfully employed
in recent works as an image descriptor for various vision tasks. But the
inability of the deep CNN features to exhibit invariance to geometric
transformations and object compositions poses a great challenge for image
search. In this work, we demonstrate the effectiveness of the objectness prior
over the deep CNN features of image regions for obtaining an invariant image
representation. The proposed approach represents the image as a vector of
pooled CNN features describing the underlying objects. This representation
provides robustness to spatial layout of the objects in the scene and achieves
invariance to general geometric transformations, such as translation, rotation
and scaling. The proposed approach also leads to a compact representation of
the scene, making each image occupy a smaller memory footprint. Experiments
show that the proposed representation achieves state of the art retrieval
results on a set of challenging benchmark image datasets, while maintaining a
compact representation.Comment: Deep Vision 201
Exploiting Image-trained CNN Architectures for Unconstrained Video Classification
We conduct an in-depth exploration of different strategies for doing event
detection in videos using convolutional neural networks (CNNs) trained for
image classification. We study different ways of performing spatial and
temporal pooling, feature normalization, choice of CNN layers as well as choice
of classifiers. Making judicious choices along these dimensions led to a very
significant increase in performance over more naive approaches that have been
used till now. We evaluate our approach on the challenging TRECVID MED'14
dataset with two popular CNN architectures pretrained on ImageNet. On this
MED'14 dataset, our methods, based entirely on image-trained CNN features, can
outperform several state-of-the-art non-CNN models. Our proposed late fusion of
CNN- and motion-based features can further increase the mean average precision
(mAP) on MED'14 from 34.95% to 38.74%. The fusion approach achieves the
state-of-the-art classification performance on the challenging UCF-101 dataset
Mixed Supervised Object Detection with Robust Objectness Transfer
In this paper, we consider the problem of leveraging existing fully labeled
categories to improve the weakly supervised detection (WSD) of new object
categories, which we refer to as mixed supervised detection (MSD). Different
from previous MSD methods that directly transfer the pre-trained object
detectors from existing categories to new categories, we propose a more
reasonable and robust objectness transfer approach for MSD. In our framework,
we first learn domain-invariant objectness knowledge from the existing fully
labeled categories. The knowledge is modeled based on invariant features that
are robust to the distribution discrepancy between the existing categories and
new categories; therefore the resulting knowledge would generalize well to new
categories and could assist detection models to reject distractors (e.g.,
object parts) in weakly labeled images of new categories. Under the guidance of
learned objectness knowledge, we utilize multiple instance learning (MIL) to
model the concepts of both objects and distractors and to further improve the
ability of rejecting distractors in weakly labeled images. Our robust
objectness transfer approach outperforms the existing MSD methods, and achieves
state-of-the-art results on the challenging ILSVRC2013 detection dataset and
the PASCAL VOC datasets.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(2019). Together with Supplementary Materials. Note: The author list in
Google Scholar is INCORRECT. The right author list is 1) Yan Li, 2) Junge
Zhang, 3) Kaiqi Huang and 4) Jianguo Zhang. The official published version
can be found in https://ieeexplore.ieee.org/abstract/document/830462
Monocular SLAM Supported Object Recognition
In this work, we develop a monocular SLAM-aware object recognition system
that is able to achieve considerably stronger recognition performance, as
compared to classical object recognition systems that function on a
frame-by-frame basis. By incorporating several key ideas including multi-view
object proposals and efficient feature encoding methods, our proposed system is
able to detect and robustly recognize objects in its environment using a single
RGB camera in near-constant time. Through experiments, we illustrate the
utility of using such a system to effectively detect and recognize objects,
incorporating multiple object viewpoint detections into a unified prediction
hypothesis. The performance of the proposed recognition system is evaluated on
the UW RGB-D Dataset, showing strong recognition performance and scalable
run-time performance compared to current state-of-the-art recognition systems.Comment: Accepted to appear at Robotics: Science and Systems 2015, Rome, Ital
ClassCut for Unsupervised Class Segmentation
Abstract. We propose a novel method for unsupervised class segmentation on a set of images. It alternates between segmenting object instances and learning a class model. The method is based on a segmentation energy defined over all images at the same time, which can be optimized efficiently by techniques used before in interactive segmentation. Over iterations, our method progressively learns a class model by integrating observations over all images. In addition to appearance, this model captures the location and shape of the class with respect to an automatically determined coordinate frame common across images. This frame allows us to build stronger shape and location models, similar to those used in object class detection. Our method is inspired by interactive segmentation methods [1], but it is fully automatic and learns models characteristic for the object class rather than specific to one particular object/image. We experimentally demonstrate on the Caltech4, Caltech101, and Weizmann horses datasets that our method (a) transfers class knowledge across images and this improves results compared to segmenting every image independently; (b) outperforms Grabcut [1] for the task of unsupervised segmentation; (c) offers competitive performance compared to the state-of-the-art in unsupervised segmentation and in particular it outperforms the topic model [2].
- …