65,659 research outputs found
Appearance modeling under geometric context for object recognition in videos
Object recognition is a very important high-level task in
surveillance applications. This dissertation focuses on building
appearance models for object recognition and exploring the
relationship between shape and appearance for two key types of
objects, human and vehicle. The dissertation proposes a generic
framework that models the appearance while incorporating certain
geometric prior information, or the so-called geometric
context. Then under this framework, special methods are developed
for recognizing humans and vehicles based on their appearance and
shape attributes in surveillance videos.
The first part of the dissertation presents a unified framework
based on a general definition of geometric transform (GeT) which is
applied to modeling object appearances under geometric context. The
GeT models the appearance by applying designed functionals over
certain geometric sets. GeT unifies Radon transform, trace
transform, image warping etc. Moreover, five novel types of GeTs are
introduced and applied to fingerprinting the appearance inside a
contour. They include GeT based on level sets, GeT based on shape
matching, GeT based on feature curves, GeT invariant to occlusion,
and a multi-resolution GeT (MRGeT) that combines both shape and
appearance information.
The second part focuses on how to use the GeT to build appearance
models for objects like walking humans, which have articulated
motion of body parts. This part also illustrates the application of
GeT for object recognition, image segmentation, video retrieval, and
image synthesis. The proposed approach produces promising results
when applied to automatic body part segmentation and fingerprinting
the appearance of a human and body parts despite the presence of
non-rigid deformations and articulated motion.
It is very important to understand the 3D structure of vehicles in
order to recognize them. To reconstruct the 3D model of a vehicle,
the third part presents a factorization method for structure from
planar motion. Experimental results show that the algorithm
is accurate and fairly robust to noise and inaccurate calibration.
Differences and the dual relationship between planar motion and
planar object are also clarified in this part. Based on our method,
a fully automated vehicle reconstruction system has been designed
Relation Networks for Object Detection
Although it is well believed for years that modeling relations between
objects would help object recognition, there has not been evidence that the
idea is working in the deep learning era. All state-of-the-art object detection
systems still rely on recognizing object instances individually, without
exploiting their relations during learning.
This work proposes an object relation module. It processes a set of objects
simultaneously through interaction between their appearance feature and
geometry, thus allowing modeling of their relations. It is lightweight and
in-place. It does not require additional supervision and is easy to embed in
existing networks. It is shown effective on improving object recognition and
duplicate removal steps in the modern object detection pipeline. It verifies
the efficacy of modeling object relations in CNN based detection. It gives rise
to the first fully end-to-end object detector
Lifting GIS Maps into Strong Geometric Context for Scene Understanding
Contextual information can have a substantial impact on the performance of
visual tasks such as semantic segmentation, object detection, and geometric
estimation. Data stored in Geographic Information Systems (GIS) offers a rich
source of contextual information that has been largely untapped by computer
vision. We propose to leverage such information for scene understanding by
combining GIS resources with large sets of unorganized photographs using
Structure from Motion (SfM) techniques. We present a pipeline to quickly
generate strong 3D geometric priors from 2D GIS data using SfM models aligned
with minimal user input. Given an image resectioned against this model, we
generate robust predictions of depth, surface normals, and semantic labels. We
show that the precision of the predicted geometry is substantially more
accurate other single-image depth estimation methods. We then demonstrate the
utility of these contextual constraints for re-scoring pedestrian detections,
and use these GIS contextual features alongside object detection score maps to
improve a CRF-based semantic segmentation framework, boosting accuracy over
baseline models
- …