67 research outputs found
A 4D Light-Field Dataset and CNN Architectures for Material Recognition
We introduce a new light-field dataset of materials, and take advantage of
the recent success of deep learning to perform material recognition on the 4D
light-field. Our dataset contains 12 material categories, each with 100 images
taken with a Lytro Illum, from which we extract about 30,000 patches in total.
To the best of our knowledge, this is the first mid-size dataset for
light-field images. Our main goal is to investigate whether the additional
information in a light-field (such as multiple sub-aperture views and
view-dependent reflectance effects) can aid material recognition. Since
recognition networks have not been trained on 4D images before, we propose and
compare several novel CNN architectures to train on light-field images. In our
experiments, the best performing CNN architecture achieves a 7% boost compared
with 2D image classification (70% to 77%). These results constitute important
baselines that can spur further research in the use of CNNs for light-field
applications. Upon publication, our dataset also enables other novel
applications of light-fields, including object detection, image segmentation
and view interpolation.Comment: European Conference on Computer Vision (ECCV) 201
Learning Free-Form Deformations for 3D Object Reconstruction
Representing 3D shape in deep learning frameworks in an accurate, efficient
and compact manner still remains an open challenge. Most existing work
addresses this issue by employing voxel-based representations. While these
approaches benefit greatly from advances in computer vision by generalizing 2D
convolutions to the 3D setting, they also have several considerable drawbacks.
The computational complexity of voxel-encodings grows cubically with the
resolution thus limiting such representations to low-resolution 3D
reconstruction. In an attempt to solve this problem, point cloud
representations have been proposed. Although point clouds are more efficient
than voxel representations as they only cover surfaces rather than volumes,
they do not encode detailed geometric information about relationships between
points. In this paper we propose a method to learn free-form deformations (FFD)
for the task of 3D reconstruction from a single image. By learning to deform
points sampled from a high-quality mesh, our trained model can be used to
produce arbitrarily dense point clouds or meshes with fine-grained geometry. We
evaluate our proposed framework on both synthetic and real-world data and
achieve state-of-the-art results on point-cloud and volumetric metrics.
Additionally, we qualitatively demonstrate its applicability to label
transferring for 3D semantic segmentation.Comment: 16 pages, 7 figures, 3 table
Inner Space Preserving Generative Pose Machine
Image-based generative methods, such as generative adversarial networks
(GANs) have already been able to generate realistic images with much context
control, specially when they are conditioned. However, most successful
frameworks share a common procedure which performs an image-to-image
translation with pose of figures in the image untouched. When the objective is
reposing a figure in an image while preserving the rest of the image, the
state-of-the-art mainly assumes a single rigid body with simple background and
limited pose shift, which can hardly be extended to the images under normal
settings. In this paper, we introduce an image "inner space" preserving model
that assigns an interpretable low-dimensional pose descriptor (LDPD) to an
articulated figure in the image. Figure reposing is then generated by passing
the LDPD and the original image through multi-stage augmented hourglass
networks in a conditional GAN structure, called inner space preserving
generative pose machine (ISP-GPM). We evaluated ISP-GPM on reposing human
figures, which are highly articulated with versatile variations. Test of a
state-of-the-art pose estimator on our reposed dataset gave an accuracy over
80% on PCK0.5 metric. The results also elucidated that our ISP-GPM is able to
preserve the background with high accuracy while reasonably recovering the area
blocked by the figure to be reposed.Comment: http://www.northeastern.edu/ostadabbas/2018/07/23/inner-space-preserving-generative-pose-machine
Scene Segmentation Driven by Deep Learning and Surface Fitting
This paper proposes a joint color and depth segmentation scheme exploiting together geometrical clues and a learning stage. The approach starts from an initial over-segmentation based on spectral clustering. The input data is also fed to a Convolutional Neural Network (CNN) thus producing a per-pixel descriptor vector for each scene sample. An iterative merging procedure is then used to recombine the segments into the regions corresponding to the various objects and surfaces. The proposed algorithm starts by considering all the adjacent segments and computing a similarity metric according to the CNN features. The couples of segments with higher similarity are considered for merging. Finally the algorithm uses a NURBS surface fitting scheme on the segments in order to understand if the selected couples correspond to a single surface. The comparison with state-of-the-art methods shows how the proposed method provides an accurate and reliable scene segmentation
Knowledge transfer for scene-specific motion prediction
When given a single frame of the video, humans can not only interpret the content of the scene, but also they are able to forecast the near future. This ability is mostly driven by their rich prior knowledge about the visual world, both in terms of (i) the dynamics of moving agents, as well as (ii) the semantic of the scene. In this work we exploit the interplay between these two key elements to predict scene-specific motion patterns. First, we extract patch descriptors encoding the probability of moving to the adjacent patches, and the probability of being in that particular patch or changing behavior. Then, we introduce a Dynamic Bayesian Network which exploits this scene specific knowledge for trajectory prediction. Experimental results demonstrate that our method is able to accurately predict trajectories and transfer predictions to a novel scene characterized by similar elements
Towards Semantic Segmentation Using Ratio Unpooling
The file attached to this record is the author's final peer reviewed version.This paper presents the concept of Ratio Unpooling as a
means of improving the performance of an Encoder-Decoder Convolutional
Neural Network (CNN) when applied to Semantic Segmentation.
Ratio Unpooling allows for 4 times the amount of positional information
to be carried through the network resulting in more precise border definition
and more resilient handling of unusual conditions such as heavy
shadows when compared to Switch Unpooling. Applied here as a proof-of-concept
to a simple implementation of SegNet which has been retrained
on a cropped and resized version of the CityScapes Dataset, Ratio Unpooling
increases the Mean Intersection over Union (IoU) performance
by around 5-6% on both the KITTI and modifi ed Cityscapes datasets, a
greater gain than by applying Monte Carlo Dropout at a fraction of the
cost
Deep Shape from a Low Number of Silhouettes
Despite strong progress in the field of 3D reconstruction from multiple views, holes on objects, transparency of objects and textureless scenes, continue to be open challenges. On the other hand, silhouette based reconstruction techniques ease the dependency of 3d reconstruction on image pixels but need a large number of silhouettes to be available from multiple views. In this paper, a novel end to end pipeline is proposed to produce high quality reconstruction from a low number of silhouettes, the core of which is a deep shape reconstruction architecture. Evaluations on ShapeNet [1] show good quality of reconstruction compared with ground truth
Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks
© Springer International Publishing AG 2016. In this paper, we tackle the problem of RGB-D semantic segmentation of indoor images. We take advantage of deconvolutional networks which can predict pixel-wise class labels, and develop a new structure for deconvolution of multiple modalities. We propose a novel feature transformation network to bridge the convolutional networks and deconvolutional networks. In the feature transformation network, we correlate the two modalities by discovering common features between them, as well as characterize each modality by discovering modality specific features. With the common features, we not only closely correlate the two modalities, but also allow them to borrow features from each other to enhance the representation of shared information. With specific features, we capture the visual patterns that are only visible in one modality. The proposed network achieves competitive segmentation accuracy on NYU depth dataset V1 and V2
- …