11,536 research outputs found
Shape Representations Using Nested Descriptors
The problem of shape representation is a core problem in computer vision. It can be argued that shape representation is the most central representational problem for computer vision, since unlike texture or color, shape alone can be used for perceptual tasks such as image matching, object detection and object categorization.
This dissertation introduces a new shape representation called the nested descriptor. A nested descriptor represents shape both globally and locally by pooling salient scaled and oriented complex gradients in a large nested support set. We show that this nesting property introduces a nested correlation structure that enables a new local distance function called the nesting distance, which provides a provably robust similarity function for image matching. Furthermore, the nesting property suggests an elegant flower like normalization strategy called a log-spiral difference. We show that this normalization enables a compact binary representation and is equivalent to a form a bottom up saliency. This suggests that the nested descriptor representational power is due to representing salient edges, which makes a fundamental connection between the saliency and local feature descriptor literature. In this dissertation, we introduce three examples of shape representation using nested descriptors: nested shape descriptors for imagery, nested motion descriptors for video and nested pooling for activities. We show evaluation results for these representations that demonstrate state-of-the-art performance for image matching, wide baseline stereo and activity recognition tasks
Pick and Place Without Geometric Object Models
We propose a novel formulation of robotic pick and place as a deep
reinforcement learning (RL) problem. Whereas most deep RL approaches to robotic
manipulation frame the problem in terms of low level states and actions, we
propose a more abstract formulation. In this formulation, actions are target
reach poses for the hand and states are a history of such reaches. We show this
approach can solve a challenging class of pick-place and regrasping problems
where the exact geometry of the objects to be handled is unknown. The only
information our method requires is: 1) the sensor perception available to the
robot at test time; 2) prior knowledge of the general class of objects for
which the system was trained. We evaluate our method using objects belonging to
two different categories, mugs and bottles, both in simulation and on real
hardware. Results show a major improvement relative to a shape primitives
baseline
Detecting the presence of large buildings in natural images
This paper addresses the issue of classification of lowlevel
features into high-level semantic concepts for the purpose of semantic annotation of consumer photographs. We adopt a multi-scale approach that relies on edge detection to extract an edge orientation-based feature description of the image, and apply an SVM learning technique to infer the presence of a dominant building object in a general purpose collection of digital photographs. The approach exploits prior knowledge on the image context through an assumption that all input images are �outdoor�, i.e. indoor/outdoor classification (the context determination stage) has been performed. The proposed approach is validated on a diverse dataset of 1720 images and its performance compared with that of the MPEG-7 edge histogram descriptor
An improved image segmentation algorithm for salient object detection
Semantic object detection is one of the most important and challenging problems in image analysis. Segmentation is an optimal approach to detect salient objects, but often fails to generate meaningful regions due to over-segmentation. This paper presents an improved semantic segmentation approach which is based on JSEG algorithm and utilizes multiple region merging criteria. The experimental results demonstrate that the proposed algorithm is encouraging and effective in salient object detection
A deep representation for depth images from synthetic data
Convolutional Neural Networks (CNNs) trained on large scale RGB databases
have become the secret sauce in the majority of recent approaches for object
categorization from RGB-D data. Thanks to colorization techniques, these
methods exploit the filters learned from 2D images to extract meaningful
representations in 2.5D. Still, the perceptual signature of these two kind of
images is very different, with the first usually strongly characterized by
textures, and the second mostly by silhouettes of objects. Ideally, one would
like to have two CNNs, one for RGB and one for depth, each trained on a
suitable data collection, able to capture the perceptual properties of each
channel for the task at hand. This has not been possible so far, due to the
lack of a suitable depth database. This paper addresses this issue, proposing
to opt for synthetically generated images rather than collecting by hand a 2.5D
large scale database. While being clearly a proxy for real data, synthetic
images allow to trade quality for quantity, making it possible to generate a
virtually infinite amount of data. We show that the filters learned from such
data collection, using the very same architecture typically used on visual
data, learns very different filters, resulting in depth features (a) able to
better characterize the different facets of depth images, and (b) complementary
with respect to those derived from CNNs pre-trained on 2D datasets. Experiments
on two publicly available databases show the power of our approach
Recommended from our members
A model of ganglion axon pathways accounts for percepts elicited by retinal implants.
Degenerative retinal diseases such as retinitis pigmentosa and macular degeneration cause irreversible vision loss in more than 10 million people worldwide. Retinal prostheses, now implanted in over 250 patients worldwide, electrically stimulate surviving cells in order to evoke neuronal responses that are interpreted by the brain as visual percepts ('phosphenes'). However, instead of seeing focal spots of light, current implant users perceive highly distorted phosphenes that vary in shape both across subjects and electrodes. We characterized these distortions by asking users of the Argus retinal prosthesis system (Second Sight Medical Products Inc.) to draw electrically elicited percepts on a touchscreen. Using ophthalmic fundus imaging and computational modeling, we show that elicited percepts can be accurately predicted by the topographic organization of optic nerve fiber bundles in each subject's retina, successfully replicating visual percepts ranging from 'blobs' to oriented 'streaks' and 'wedges' depending on the retinal location of the stimulating electrode. This provides the first evidence that activation of passing axon fibers accounts for the rich repertoire of phosphene shape commonly reported in psychophysical experiments, which can severely distort the quality of the generated visual experience. Overall our findings argue for more detailed modeling of biological detail across neural engineering applications
Cumulative object categorization in clutter
In this paper we present an approach based on scene- or part-graphs for geometrically categorizing touching and
occluded objects. We use additive RGBD feature descriptors and hashing of graph configuration parameters for describing the spatial arrangement of constituent parts. The presented experiments quantify that this method outperforms our earlier part-voting and sliding window classification. We evaluated our approach on cluttered scenes, and by using a 3D dataset containing over 15000 Kinect scans of over 100 objects which were grouped into general geometric categories. Additionally, color, geometric, and combined features were compared for categorization tasks
Leveraging Deep Visual Descriptors for Hierarchical Efficient Localization
Many robotics applications require precise pose estimates despite operating
in large and changing environments. This can be addressed by visual
localization, using a pre-computed 3D model of the surroundings. The pose
estimation then amounts to finding correspondences between 2D keypoints in a
query image and 3D points in the model using local descriptors. However,
computational power is often limited on robotic platforms, making this task
challenging in large-scale environments. Binary feature descriptors
significantly speed up this 2D-3D matching, and have become popular in the
robotics community, but also strongly impair the robustness to perceptual
aliasing and changes in viewpoint, illumination and scene structure. In this
work, we propose to leverage recent advances in deep learning to perform an
efficient hierarchical localization. We first localize at the map level using
learned image-wide global descriptors, and subsequently estimate a precise pose
from 2D-3D matches computed in the candidate places only. This restricts the
local search and thus allows to efficiently exploit powerful non-binary
descriptors usually dismissed on resource-constrained devices. Our approach
results in state-of-the-art localization performance while running in real-time
on a popular mobile platform, enabling new prospects for robotics research.Comment: CoRL 2018 Camera-ready (fix typos and update citations
- …