2,020 research outputs found
Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers
Scene parsing, or semantic segmentation, consists in labeling each pixel in
an image with the category of the object it belongs to. It is a challenging
task that involves the simultaneous detection, segmentation and recognition of
all the objects in the image.
The scene parsing method proposed here starts by computing a tree of segments
from a graph of pixel dissimilarities. Simultaneously, a set of dense feature
vectors is computed which encodes regions of multiple sizes centered on each
pixel. The feature extractor is a multiscale convolutional network trained from
raw pixels. The feature vectors associated with the segments covered by each
node in the tree are aggregated and fed to a classifier which produces an
estimate of the distribution of object categories contained in the segment. A
subset of tree nodes that cover the image are then selected so as to maximize
the average "purity" of the class distributions, hence maximizing the overall
likelihood that each segment will contain a single object. The convolutional
network feature extractor is trained end-to-end from raw pixels, alleviating
the need for engineered features. After training, the system is parameter free.
The system yields record accuracies on the Stanford Background Dataset (8
classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170
classes) while being an order of magnitude faster than competing approaches,
producing a 320 \times 240 image labeling in less than 1 second.Comment: 9 pages, 4 figures - Published in 29th International Conference on
Machine Learning (ICML 2012), Jun 2012, Edinburgh, United Kingdo
Indoor Semantic Segmentation using depth information
This work addresses multi-class segmentation of indoor scenes with RGB-D
inputs. While this area of research has gained much attention recently, most
works still rely on hand-crafted features. In contrast, we apply a multiscale
convolutional network to learn features directly from the images and the depth
information. We obtain state-of-the-art on the NYU-v2 depth dataset with an
accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos
sequences that could be processed in real-time using appropriate hardware such
as an FPGA.Comment: 8 pages, 3 figure
Learned versus Hand-Designed Feature Representations for 3d Agglomeration
For image recognition and labeling tasks, recent results suggest that machine
learning methods that rely on manually specified feature representations may be
outperformed by methods that automatically derive feature representations based
on the data. Yet for problems that involve analysis of 3d objects, such as mesh
segmentation, shape retrieval, or neuron fragment agglomeration, there remains
a strong reliance on hand-designed feature descriptors. In this paper, we
evaluate a large set of hand-designed 3d feature descriptors alongside features
learned from the raw data using both end-to-end and unsupervised learning
techniques, in the context of agglomeration of 3d neuron fragments. By
combining unsupervised learning techniques with a novel dynamic pooling scheme,
we show how pure learning-based methods are for the first time competitive with
hand-designed 3d shape descriptors. We investigate data augmentation strategies
for dramatically increasing the size of the training set, and show how
combining both learned and hand-designed features leads to the highest
accuracy
- …