205 research outputs found
Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers
Scene parsing, or semantic segmentation, consists in labeling each pixel in
an image with the category of the object it belongs to. It is a challenging
task that involves the simultaneous detection, segmentation and recognition of
all the objects in the image.
The scene parsing method proposed here starts by computing a tree of segments
from a graph of pixel dissimilarities. Simultaneously, a set of dense feature
vectors is computed which encodes regions of multiple sizes centered on each
pixel. The feature extractor is a multiscale convolutional network trained from
raw pixels. The feature vectors associated with the segments covered by each
node in the tree are aggregated and fed to a classifier which produces an
estimate of the distribution of object categories contained in the segment. A
subset of tree nodes that cover the image are then selected so as to maximize
the average "purity" of the class distributions, hence maximizing the overall
likelihood that each segment will contain a single object. The convolutional
network feature extractor is trained end-to-end from raw pixels, alleviating
the need for engineered features. After training, the system is parameter free.
The system yields record accuracies on the Stanford Background Dataset (8
classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170
classes) while being an order of magnitude faster than competing approaches,
producing a 320 \times 240 image labeling in less than 1 second.Comment: 9 pages, 4 figures - Published in 29th International Conference on
Machine Learning (ICML 2012), Jun 2012, Edinburgh, United Kingdo
Indoor Semantic Segmentation using depth information
This work addresses multi-class segmentation of indoor scenes with RGB-D
inputs. While this area of research has gained much attention recently, most
works still rely on hand-crafted features. In contrast, we apply a multiscale
convolutional network to learn features directly from the images and the depth
information. We obtain state-of-the-art on the NYU-v2 depth dataset with an
accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos
sequences that could be processed in real-time using appropriate hardware such
as an FPGA.Comment: 8 pages, 3 figure
Clustering Learning for Robotic Vision
We present the clustering learning technique applied to multi-layer
feedforward deep neural networks. We show that this unsupervised learning
technique can compute network filters with only a few minutes and a much
reduced set of parameters. The goal of this paper is to promote the technique
for general-purpose robotic vision systems. We report its use in static image
datasets and object tracking datasets. We show that networks trained with
clustering learning can outperform large networks trained for many hours on
complex datasets.Comment: Code for this paper is available here:
https://github.com/culurciello/CL_paper1_cod
Convolutional Nets and Watershed Cuts for Real-Time Semantic Labeling of RGBD Videos
International audienceThis work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on handcrafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth dataset with an accuracy of 64.5%. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the di erent methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA
Bridging the Gap Between Neural Networks and Neuromorphic Hardware with A Neural Network Compiler
Different from developing neural networks (NNs) for general-purpose
processors, the development for NN chips usually faces with some
hardware-specific restrictions, such as limited precision of network signals
and parameters, constrained computation scale, and limited types of non-linear
functions.
This paper proposes a general methodology to address the challenges. We
decouple the NN applications from the target hardware by introducing a compiler
that can transform an existing trained, unrestricted NN into an equivalent
network that meets the given hardware's constraints. We propose multiple
techniques to make the transformation adaptable to different kinds of NN chips,
and reliable for restrict hardware constraints.
We have built such a software tool that supports both spiking neural networks
(SNNs) and traditional artificial neural networks (ANNs). We have demonstrated
its effectiveness with a fabricated neuromorphic chip and a
processing-in-memory (PIM) design. Tests show that the inference error caused
by this solution is insignificant and the transformation time is much shorter
than the retraining time. Also, we have studied the parameter-sensitivity
evaluations to explore the tradeoffs between network error and resource
utilization for different transformation strategies, which could provide
insights for co-design optimization of neuromorphic hardware and software.Comment: Accepted by ASPLOS 201
A 4D Light-Field Dataset and CNN Architectures for Material Recognition
We introduce a new light-field dataset of materials, and take advantage of
the recent success of deep learning to perform material recognition on the 4D
light-field. Our dataset contains 12 material categories, each with 100 images
taken with a Lytro Illum, from which we extract about 30,000 patches in total.
To the best of our knowledge, this is the first mid-size dataset for
light-field images. Our main goal is to investigate whether the additional
information in a light-field (such as multiple sub-aperture views and
view-dependent reflectance effects) can aid material recognition. Since
recognition networks have not been trained on 4D images before, we propose and
compare several novel CNN architectures to train on light-field images. In our
experiments, the best performing CNN architecture achieves a 7% boost compared
with 2D image classification (70% to 77%). These results constitute important
baselines that can spur further research in the use of CNNs for light-field
applications. Upon publication, our dataset also enables other novel
applications of light-fields, including object detection, image segmentation
and view interpolation.Comment: European Conference on Computer Vision (ECCV) 201
Collaborative Layer-wise Discriminative Learning in Deep Neural Networks
Intermediate features at different layers of a deep neural network are known
to be discriminative for visual patterns of different complexities. However,
most existing works ignore such cross-layer heterogeneities when classifying
samples of different complexities. For example, if a training sample has
already been correctly classified at a specific layer with high confidence, we
argue that it is unnecessary to enforce rest layers to classify this sample
correctly and a better strategy is to encourage those layers to focus on other
samples.
In this paper, we propose a layer-wise discriminative learning method to
enhance the discriminative capability of a deep network by allowing its layers
to work collaboratively for classification. Towards this target, we introduce
multiple classifiers on top of multiple layers. Each classifier not only tries
to correctly classify the features from its input layer, but also coordinates
with other classifiers to jointly maximize the final classification
performance. Guided by the other companion classifiers, each classifier learns
to concentrate on certain training examples and boosts the overall performance.
Allowing for end-to-end training, our method can be conveniently embedded into
state-of-the-art deep networks. Experiments with multiple popular deep
networks, including Network in Network, GoogLeNet and VGGNet, on scale-various
object classification benchmarks, including CIFAR100, MNIST and ImageNet, and
scene classification benchmarks, including MIT67, SUN397 and Places205,
demonstrate the effectiveness of our method. In addition, we also analyze the
relationship between the proposed method and classical conditional random
fields models.Comment: To appear in ECCV 2016. Maybe subject to minor changes before
camera-ready versio
Implementing Neural Networks Efficiently
Neural networks and machine learning algorithms in general require a flexible environment where new algorithm prototypes and experiments can be set up as quickly as possible with best possible computational performance. To that end, we provide a new framework called Torch7, that is especially suited to achieve both of these competing goals. Torch7 is a versatile numeric computing framework and machine learning library that extends a very lightweight and powerful programming language Lua. Its goal is to provide a flexible environment to design, train and deploy learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can also easily be interfaced to third-party software thanks to Lua’s light C interface
Learning Free-Form Deformations for 3D Object Reconstruction
Representing 3D shape in deep learning frameworks in an accurate, efficient
and compact manner still remains an open challenge. Most existing work
addresses this issue by employing voxel-based representations. While these
approaches benefit greatly from advances in computer vision by generalizing 2D
convolutions to the 3D setting, they also have several considerable drawbacks.
The computational complexity of voxel-encodings grows cubically with the
resolution thus limiting such representations to low-resolution 3D
reconstruction. In an attempt to solve this problem, point cloud
representations have been proposed. Although point clouds are more efficient
than voxel representations as they only cover surfaces rather than volumes,
they do not encode detailed geometric information about relationships between
points. In this paper we propose a method to learn free-form deformations (FFD)
for the task of 3D reconstruction from a single image. By learning to deform
points sampled from a high-quality mesh, our trained model can be used to
produce arbitrarily dense point clouds or meshes with fine-grained geometry. We
evaluate our proposed framework on both synthetic and real-world data and
achieve state-of-the-art results on point-cloud and volumetric metrics.
Additionally, we qualitatively demonstrate its applicability to label
transferring for 3D semantic segmentation.Comment: 16 pages, 7 figures, 3 table
- …