27,606 research outputs found
Learning to Generate and Refine Object Proposals
Visual object recognition is a fundamental and challenging
problem in computer vision. To build a practical recognition
system, one is first confronted with high computation complexity
due to an enormous search space from an image, which is caused by
large variations in object appearance, pose and mutual occlusion,
as well as other environmental factors. To reduce the search
complexity, a moderate set of image regions that are likely to
contain an object, regardless of its category, are usually first
generated in modern object recognition subsystems. These possible
object regions are called object proposals, object hypotheses or
object candidates, which can be used for down-stream
classification or global reasoning in many different vision tasks
like object detection, segmentation and tracking, etc.
This thesis addresses the problem of object proposal generation,
including bounding box and segment proposal generation, in
real-world scenarios. In particular, we investigate the
representation learning in object proposal generation with 3D
cues and contextual information, aiming to propose higher-quality
object candidates which have higher object recall, better
boundary coverage and lower number. We focus on three main
issues: 1) how can we incorporate additional geometric and
high-level semantic context information into the proposal
generation for stereo images? 2) how do we generate object
segment proposals for stereo images with learning representations
and learning grouping process? and 3) how can we learn a
context-driven representation to refine segment proposals
efficiently?
In this thesis, we propose a series of solutions to address each
of the raised problems. We first propose a semantic context and
depth-aware object proposal generation method. We design a set of
new cues to encode the objectness, and then train an efficient
random forest classifier to re-rank the initial proposals and
linear regressors to fine-tune their locations. Next, we extend
the task to the segment proposal generation in the same setting
and develop a learning-based segment proposal generation method
for stereo images. Our method makes use of learned deep features
and designed geometric features to represent a region and learns
a similarity network to guide the superpixel grouping process. We
also learn a ranking network to predict the objectness score for
each segment proposal. To address the third problem, we take a
transformation-based approach to improve the quality of a given
segment candidate pool based on context information. We propose
an efficient deep network that learns affine transformations to
warp an initial object mask towards nearby object region, based
on a novel feature pooling strategy. Finally, we extend our
affine warping approach to address the object-mask alignment
problem and particularly the problem of refining a set of segment
proposals. We design an end-to-end deep spatial transformer
network that learns free-form deformations (FFDs) to non-rigidly
warp the shape mask towards the ground truth, based on a
multi-level dual mask feature pooling strategy. We evaluate all
our approaches on several publicly available object recognition
datasets and show superior performance
Object Referring in Visual Scene with Spoken Language
Object referring has important applications, especially for human-machine
interaction. While having received great attention, the task is mainly attacked
with written language (text) as input rather than spoken language (speech),
which is more natural. This paper investigates Object Referring with Spoken
Language (ORSpoken) by presenting two datasets and one novel approach. Objects
are annotated with their locations in images, text descriptions and speech
descriptions. This makes the datasets ideal for multi-modality learning. The
approach is developed by carefully taking down ORSpoken problem into three
sub-problems and introducing task-specific vision-language interactions at the
corresponding levels. Experiments show that our method outperforms competing
methods consistently and significantly. The approach is also evaluated in the
presence of audio noise, showing the efficacy of the proposed vision-language
interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201
Online Object Tracking with Proposal Selection
Tracking-by-detection approaches are some of the most successful object
trackers in recent years. Their success is largely determined by the detector
model they learn initially and then update over time. However, under
challenging conditions where an object can undergo transformations, e.g.,
severe rotation, these methods are found to be lacking. In this paper, we
address this problem by formulating it as a proposal selection task and making
two contributions. The first one is introducing novel proposals estimated from
the geometric transformations undergone by the object, and building a rich
candidate set for predicting the object location. The second one is devising a
novel selection strategy using multiple cues, i.e., detection score and
edgeness score computed from state-of-the-art object edges and motion
boundaries. We extensively evaluate our approach on the visual object tracking
2014 challenge and online tracking benchmark datasets, and show the best
performance.Comment: ICCV 201
Unconstrained salient object detection via proposal subset optimization
We aim at detecting salient objects in unconstrained images. In unconstrained images, the number of salient objects (if any) varies from image to image, and is not given. We present a salient object detection system that directly outputs a compact set of detection windows, if any, for an input image. Our system leverages a Convolutional-Neural-Network model to generate location proposals of salient objects. Location proposals tend to be highly overlapping and noisy. Based on the Maximum a Posteriori principle, we propose a novel subset optimization framework to generate a compact set of detection windows out of noisy proposals. In experiments, we show that our subset optimization formulation greatly enhances the performance of our system, and our system attains 16-34% relative improvement in Average Precision compared with the state-of-the-art on three challenging salient object datasets.http://openaccess.thecvf.com/content_cvpr_2016/html/Zhang_Unconstrained_Salient_Object_CVPR_2016_paper.htmlPublished versio
DeepBox: Learning Objectness with Convolutional Networks
Existing object proposal approaches use primarily bottom-up cues to rank
proposals, while we believe that objectness is in fact a high level construct.
We argue for a data-driven, semantic approach for ranking object proposals. Our
framework, which we call DeepBox, uses convolutional neural networks (CNNs) to
rerank proposals from a bottom-up method. We use a novel four-layer CNN
architecture that is as good as much larger networks on the task of evaluating
objectness while being much faster. We show that DeepBox significantly improves
over the bottom-up ranking, achieving the same recall with 500 proposals as
achieved by bottom-up methods with 2000. This improvement generalizes to
categories the CNN has never seen before and leads to a 4.5-point gain in
detection mAP. Our implementation achieves this performance while running at
260 ms per image.Comment: ICCV 2015 Camera-ready versio
Relation Networks for Object Detection
Although it is well believed for years that modeling relations between
objects would help object recognition, there has not been evidence that the
idea is working in the deep learning era. All state-of-the-art object detection
systems still rely on recognizing object instances individually, without
exploiting their relations during learning.
This work proposes an object relation module. It processes a set of objects
simultaneously through interaction between their appearance feature and
geometry, thus allowing modeling of their relations. It is lightweight and
in-place. It does not require additional supervision and is easy to embed in
existing networks. It is shown effective on improving object recognition and
duplicate removal steps in the modern object detection pipeline. It verifies
the efficacy of modeling object relations in CNN based detection. It gives rise
to the first fully end-to-end object detector
From Facial Parts Responses to Face Detection: A Deep Learning Approach
In this paper, we propose a novel deep convolutional network (DCN) that
achieves outstanding performance on FDDB, PASCAL Face, and AFW. Specifically,
our method achieves a high recall rate of 90.99% on the challenging FDDB
benchmark, outperforming the state-of-the-art method by a large margin of
2.91%. Importantly, we consider finding faces from a new perspective through
scoring facial parts responses by their spatial structure and arrangement. The
scoring mechanism is carefully formulated considering challenging cases where
faces are only partially visible. This consideration allows our network to
detect faces under severe occlusion and unconstrained pose variation, which are
the main difficulty and bottleneck of most existing face detection approaches.
We show that despite the use of DCN, our network can achieve practical runtime
speed.Comment: To appear in ICCV 201
- …