92 research outputs found
A 4D Light-Field Dataset and CNN Architectures for Material Recognition
We introduce a new light-field dataset of materials, and take advantage of
the recent success of deep learning to perform material recognition on the 4D
light-field. Our dataset contains 12 material categories, each with 100 images
taken with a Lytro Illum, from which we extract about 30,000 patches in total.
To the best of our knowledge, this is the first mid-size dataset for
light-field images. Our main goal is to investigate whether the additional
information in a light-field (such as multiple sub-aperture views and
view-dependent reflectance effects) can aid material recognition. Since
recognition networks have not been trained on 4D images before, we propose and
compare several novel CNN architectures to train on light-field images. In our
experiments, the best performing CNN architecture achieves a 7% boost compared
with 2D image classification (70% to 77%). These results constitute important
baselines that can spur further research in the use of CNNs for light-field
applications. Upon publication, our dataset also enables other novel
applications of light-fields, including object detection, image segmentation
and view interpolation.Comment: European Conference on Computer Vision (ECCV) 201
Collaborative Layer-wise Discriminative Learning in Deep Neural Networks
Intermediate features at different layers of a deep neural network are known
to be discriminative for visual patterns of different complexities. However,
most existing works ignore such cross-layer heterogeneities when classifying
samples of different complexities. For example, if a training sample has
already been correctly classified at a specific layer with high confidence, we
argue that it is unnecessary to enforce rest layers to classify this sample
correctly and a better strategy is to encourage those layers to focus on other
samples.
In this paper, we propose a layer-wise discriminative learning method to
enhance the discriminative capability of a deep network by allowing its layers
to work collaboratively for classification. Towards this target, we introduce
multiple classifiers on top of multiple layers. Each classifier not only tries
to correctly classify the features from its input layer, but also coordinates
with other classifiers to jointly maximize the final classification
performance. Guided by the other companion classifiers, each classifier learns
to concentrate on certain training examples and boosts the overall performance.
Allowing for end-to-end training, our method can be conveniently embedded into
state-of-the-art deep networks. Experiments with multiple popular deep
networks, including Network in Network, GoogLeNet and VGGNet, on scale-various
object classification benchmarks, including CIFAR100, MNIST and ImageNet, and
scene classification benchmarks, including MIT67, SUN397 and Places205,
demonstrate the effectiveness of our method. In addition, we also analyze the
relationship between the proposed method and classical conditional random
fields models.Comment: To appear in ECCV 2016. Maybe subject to minor changes before
camera-ready versio
Learning Free-Form Deformations for 3D Object Reconstruction
Representing 3D shape in deep learning frameworks in an accurate, efficient
and compact manner still remains an open challenge. Most existing work
addresses this issue by employing voxel-based representations. While these
approaches benefit greatly from advances in computer vision by generalizing 2D
convolutions to the 3D setting, they also have several considerable drawbacks.
The computational complexity of voxel-encodings grows cubically with the
resolution thus limiting such representations to low-resolution 3D
reconstruction. In an attempt to solve this problem, point cloud
representations have been proposed. Although point clouds are more efficient
than voxel representations as they only cover surfaces rather than volumes,
they do not encode detailed geometric information about relationships between
points. In this paper we propose a method to learn free-form deformations (FFD)
for the task of 3D reconstruction from a single image. By learning to deform
points sampled from a high-quality mesh, our trained model can be used to
produce arbitrarily dense point clouds or meshes with fine-grained geometry. We
evaluate our proposed framework on both synthetic and real-world data and
achieve state-of-the-art results on point-cloud and volumetric metrics.
Additionally, we qualitatively demonstrate its applicability to label
transferring for 3D semantic segmentation.Comment: 16 pages, 7 figures, 3 table
FU-net: Multi-class Image Segmentation Using Feedback Weighted U-net
In this paper, we present a generic deep convolutional neural network (DCNN)
for multi-class image segmentation. It is based on a well-established
supervised end-to-end DCNN model, known as U-net. U-net is firstly modified by
adding widely used batch normalization and residual block (named as BRU-net) to
improve the efficiency of model training. Based on BRU-net, we further
introduce a dynamically weighted cross-entropy loss function. The weighting
scheme is calculated based on the pixel-wise prediction accuracy during the
training process. Assigning higher weights to pixels with lower segmentation
accuracies enables the network to learn more from poorly predicted image
regions. Our method is named as feedback weighted U-net (FU-net). We have
evaluated our method based on T1- weighted brain MRI for the segmentation of
midbrain and substantia nigra, where the number of pixels in each class is
extremely unbalanced to each other. Based on the dice coefficient measurement,
our proposed FU-net has outperformed BRU-net and U-net with statistical
significance, especially when only a small number of training examples are
available. The code is publicly available in GitHub (GitHub link:
https://github.com/MinaJf/FU-net).Comment: Accepted for publication at International Conference on Image and
Graphics (ICIG 2019
Inner Space Preserving Generative Pose Machine
Image-based generative methods, such as generative adversarial networks
(GANs) have already been able to generate realistic images with much context
control, specially when they are conditioned. However, most successful
frameworks share a common procedure which performs an image-to-image
translation with pose of figures in the image untouched. When the objective is
reposing a figure in an image while preserving the rest of the image, the
state-of-the-art mainly assumes a single rigid body with simple background and
limited pose shift, which can hardly be extended to the images under normal
settings. In this paper, we introduce an image "inner space" preserving model
that assigns an interpretable low-dimensional pose descriptor (LDPD) to an
articulated figure in the image. Figure reposing is then generated by passing
the LDPD and the original image through multi-stage augmented hourglass
networks in a conditional GAN structure, called inner space preserving
generative pose machine (ISP-GPM). We evaluated ISP-GPM on reposing human
figures, which are highly articulated with versatile variations. Test of a
state-of-the-art pose estimator on our reposed dataset gave an accuracy over
80% on PCK0.5 metric. The results also elucidated that our ISP-GPM is able to
preserve the background with high accuracy while reasonably recovering the area
blocked by the figure to be reposed.Comment: http://www.northeastern.edu/ostadabbas/2018/07/23/inner-space-preserving-generative-pose-machine
Scene Segmentation Driven by Deep Learning and Surface Fitting
This paper proposes a joint color and depth segmentation scheme exploiting together geometrical clues and a learning stage. The approach starts from an initial over-segmentation based on spectral clustering. The input data is also fed to a Convolutional Neural Network (CNN) thus producing a per-pixel descriptor vector for each scene sample. An iterative merging procedure is then used to recombine the segments into the regions corresponding to the various objects and surfaces. The proposed algorithm starts by considering all the adjacent segments and computing a similarity metric according to the CNN features. The couples of segments with higher similarity are considered for merging. Finally the algorithm uses a NURBS surface fitting scheme on the segments in order to understand if the selected couples correspond to a single surface. The comparison with state-of-the-art methods shows how the proposed method provides an accurate and reliable scene segmentation
Knowledge transfer for scene-specific motion prediction
When given a single frame of the video, humans can not only interpret the content of the scene, but also they are able to forecast the near future. This ability is mostly driven by their rich prior knowledge about the visual world, both in terms of (i) the dynamics of moving agents, as well as (ii) the semantic of the scene. In this work we exploit the interplay between these two key elements to predict scene-specific motion patterns. First, we extract patch descriptors encoding the probability of moving to the adjacent patches, and the probability of being in that particular patch or changing behavior. Then, we introduce a Dynamic Bayesian Network which exploits this scene specific knowledge for trajectory prediction. Experimental results demonstrate that our method is able to accurately predict trajectories and transfer predictions to a novel scene characterized by similar elements
Whatâs the Point: Semantic Segmentation with Point Supervision
The semantic image segmentation task presents a trade-off between test time
accuracy and training-time annotation cost. Detailed per-pixel annotations
enable training accurate models but are very time-consuming to obtain,
image-level class labels are an order of magnitude cheaper but result in less
accurate models. We take a natural step from image-level annotation towards
stronger supervision: we ask annotators to point to an object if one exists. We
incorporate this point supervision along with a novel objectness potential in
the training loss function of a CNN model. Experimental results on the PASCAL
VOC 2012 benchmark reveal that the combined effect of point-level supervision
and objectness potential yields an improvement of 12.9% mIOU over image-level
supervision. Further, we demonstrate that models trained with point-level
supervision are more accurate than models trained with image-level,
squiggle-level or full supervision given a fixed annotation budget.Comment: ECCV (2016) submissio
Region-Based Semantic Segmentation with End-to-End Training
We propose a novel method for semantic segmentation, the task of labeling
each pixel in an image with a semantic class. Our method combines the
advantages of the two main competing paradigms. Methods based on region
classification offer proper spatial support for appearance measurements, but
typically operate in two separate stages, none of which targets pixel labeling
performance at the end of the pipeline. More recent fully convolutional methods
are capable of end-to-end training for the final pixel labeling, but resort to
fixed patches as spatial support. We show how to modify modern region-based
approaches to enable end-to-end training for semantic segmentation. This is
achieved via a differentiable region-to-pixel layer and a differentiable
free-form Region-of-Interest pooling layer. Our method improves the
state-of-the-art in terms of class-average accuracy with 64.0% on SIFT Flow and
49.9% on PASCAL Context, and is particularly accurate at object boundaries.Comment: ECCV 2016 camera-read
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results
- âŠ