32,783 research outputs found
Depth Reconstruction of Translucent Objects from a Single Time-of-Flight Camera using Deep Residual Networks
We propose a novel approach to recovering the translucent objects from a
single time-of-flight (ToF) depth camera using deep residual networks. When
recording the translucent objects using the ToF depth camera, their depth
values are severely contaminated due to complex light interactions with the
surrounding environment. While existing methods suggested new capture systems
or developed the depth distortion models, their solutions were less practical
because of strict assumptions or heavy computational complexity. In this paper,
we adopt the deep residual networks for modeling the ToF depth distortion
caused by translucency. To fully utilize both the local and semantic
information of objects, multi-scale patches are used to predict the depth
value. Based on the quantitative and qualitative evaluation on our benchmark
database, we show the effectiveness and robustness of the proposed algorithm
Capsule Networks with Max-Min Normalization
Capsule Networks (CapsNet) use the Softmax function to convert the logits of
the routing coefficients into a set of normalized values that signify the
assignment probabilities between capsules in adjacent layers. We show that the
use of Softmax prevents capsule layers from forming optimal couplings between
lower and higher-level capsules. Softmax constrains the dynamic range of the
routing coefficients and leads to probabilities that remain mostly uniform
after several routing iterations. Instead, we propose the use of Max-Min
normalization. Max-Min performs a scale-invariant normalization of the logits
that allows each lower-level capsule to take on an independent value,
constrained only by the bounds of normalization. Max-Min provides consistent
improvement in test accuracy across five datasets and allows more routing
iterations without a decrease in network performance. A single CapsNet trained
using Max-Min achieves an improved test error of 0.20% on the MNIST dataset.
With a simple 3-model majority vote, we achieve a test error of 0.17% on MNIST
Deep Spatial Pyramid: The Devil is Once Again in the Details
In this paper we show that by carefully making good choices for various
detailed but important factors in a visual recognition framework using deep
learning features, one can achieve a simple, efficient, yet highly accurate
image classification system. We first list 5 important factors, based on both
existing researches and ideas proposed in this paper. These important detailed
factors include: 1) matrix normalization is more effective than
unnormalized or vector normalization, 2) the proposed natural deep
spatial pyramid is very effective, and 3) a very small in Fisher Vectors
surprisingly achieves higher accuracy than normally used large values.
Along with other choices (convolutional activations and multiple scales), the
proposed DSP framework is not only intuitive and efficient, but also achieves
excellent classification accuracy on many benchmark datasets. For example,
DSP's accuracy on SUN397 is 59.78%, significantly higher than previous
state-of-the-art (53.86%)
SORT: Second-Order Response Transform for Visual Recognition
In this paper, we reveal the importance and benefits of introducing
second-order operations into deep neural networks. We propose a novel approach
named Second-Order Response Transform (SORT), which appends element-wise
product transform to the linear sum of a two-branch network module. A direct
advantage of SORT is to facilitate cross-branch response propagation, so that
each branch can update its weights based on the current status of the other
branch. Moreover, SORT augments the family of transform operations and
increases the nonlinearity of the network, making it possible to learn flexible
functions to fit the complicated distribution of feature space. SORT can be
applied to a wide range of network architectures, including a branched variant
of a chain-styled network and a residual network, with very light-weighted
modifications. We observe consistent accuracy gain on both small (CIFAR10,
CIFAR100 and SVHN) and big (ILSVRC2012) datasets. In addition, SORT is very
efficient, as the extra computation overhead is less than 5%.Comment: To appear in ICCV 2017 (10 pages, 4 figures
Deep CNNs Meet Global Covariance Pooling: Better Representation and Generalization
Compared with global average pooling in existing deep convolutional neural
networks (CNNs), global covariance pooling can capture richer statistics of
deep features, having potential for improving representation and generalization
abilities of deep CNNs. However, integration of global covariance pooling into
deep CNNs brings two challenges: (1) robust covariance estimation given deep
features of high dimension and small sample size; (2) appropriate usage of
geometry of covariances. To address these challenges, we propose a global
Matrix Power Normalized COVariance (MPN-COV) Pooling. Our MPN-COV conforms to a
robust covariance estimator, very suitable for scenario of high dimension and
small sample size. It can also be regarded as Power-Euclidean metric between
covariances, effectively exploiting their geometry. Furthermore, a global
Gaussian embedding network is proposed to incorporate first-order statistics
into MPN-COV. For fast training of MPN-COV networks, we implement an iterative
matrix square root normalization, avoiding GPU unfriendly eigen-decomposition
inherent in MPN-COV. Additionally, progressive 1x1 convolutions and group
convolution are introduced to compress covariance representations. The proposed
methods are highly modular, readily plugged into existing deep CNNs. Extensive
experiments are conducted on large-scale object classification, scene
categorization, fine-grained visual recognition and texture classification,
showing our methods outperform the counterparts and obtain state-of-the-art
performance.Comment: Accepted to IEEE TPAMI. Code is at http://peihuali.org/MPN-COV
Cross-convolutional-layer Pooling for Image Recognition
Recent studies have shown that a Deep Convolutional Neural Network (DCNN)
pretrained on a large image dataset can be used as a universal image
descriptor, and that doing so leads to impressive performance for a variety of
image classification tasks. Most of these studies adopt activations from a
single DCNN layer, usually the fully-connected layer, as the image
representation. In this paper, we proposed a novel way to extract image
representations from two consecutive convolutional layers: one layer is
utilized for local feature extraction and the other serves as guidance to pool
the extracted features. By taking different viewpoints of convolutional layers,
we further develop two schemes to realize this idea. The first one directly
uses convolutional layers from a DCNN. The second one applies the pretrained
CNN on densely sampled image regions and treats the fully-connected activations
of each image region as convolutional feature activations. We then train
another convolutional layer on top of that as the pooling-guidance
convolutional layer. By applying our method to three popular visual
classification tasks, we find our first scheme tends to perform better on the
applications which need strong discrimination on subtle object patterns within
small regions while the latter excels in the cases that require discrimination
on category-level patterns. Overall, the proposed method achieves superior
performance over existing ways of extracting image representations from a DCNN.Comment: Fixed typos. Journal extension of arXiv:1411.7466. Accepted to IEEE
Transactions on Pattern Analysis and Machine Intelligenc
MoNet: Moments Embedding Network
Bilinear pooling has been recently proposed as a feature encoding layer,
which can be used after the convolutional layers of a deep network, to improve
performance in multiple vision tasks. Different from conventional global
average pooling or fully connected layer, bilinear pooling gathers 2nd order
information in a translation invariant fashion. However, a serious drawback of
this family of pooling layers is their dimensionality explosion. Approximate
pooling methods with compact properties have been explored towards resolving
this weakness. Additionally, recent results have shown that significant
performance gains can be achieved by adding 1st order information and applying
matrix normalization to regularize unstable higher order information. However,
combining compact pooling with matrix normalization and other order information
has not been explored until now. In this paper, we unify bilinear pooling and
the global Gaussian embedding layers through the empirical moment matrix. In
addition, we propose a novel sub-matrix square-root layer, which can be used to
normalize the output of the convolution layer directly and mitigate the
dimensionality problem with off-the-shelf compact pooling methods. Our
experiments on three widely used fine-grained classification datasets
illustrate that our proposed architecture, MoNet, can achieve similar or better
performance than with the state-of-art G2DeNet. Furthermore, when combined with
compact pooling technique, MoNet obtains comparable performance with encoded
features with 96% less dimensions.Comment: Accepted in CVPR 201
Learning Mid-Level Features and Modeling Neuron Selectivity for Image Classification
We now know that mid-level features can greatly enhance the performance of
image learning, but how to automatically learn the image features efficiently
and in an unsupervised manner is still an open question. In this paper, we
present a very efficient mid-level feature learning approach (MidFea), which
only involves simple operations such as -means clustering, convolution,
pooling, vector quantization and random projection. We explain why this simple
method generates the desired features, and argue that there is no need to spend
much time in learning low-level feature extractors. Furthermore, to boost the
performance, we propose to model the neuron selectivity (NS) principle by
building an additional layer over the mid-level features before feeding the
features into the classifier. We show that the NS-layer learns
category-specific neurons with both bottom-up inference and top-down analysis,
and thus supports fast inference for a query image. We run extensive
experiments on several public databases to demonstrate that our approach can
achieve state-of-the-art performances for face recognition, gender
classification, age estimation and object categorization. In particular, we
demonstrate that our approach is more than an order of magnitude faster than
some recently proposed sparse coding based methods.Comment: 19 pages, 14 figure
Exploiting Image-trained CNN Architectures for Unconstrained Video Classification
We conduct an in-depth exploration of different strategies for doing event
detection in videos using convolutional neural networks (CNNs) trained for
image classification. We study different ways of performing spatial and
temporal pooling, feature normalization, choice of CNN layers as well as choice
of classifiers. Making judicious choices along these dimensions led to a very
significant increase in performance over more naive approaches that have been
used till now. We evaluate our approach on the challenging TRECVID MED'14
dataset with two popular CNN architectures pretrained on ImageNet. On this
MED'14 dataset, our methods, based entirely on image-trained CNN features, can
outperform several state-of-the-art non-CNN models. Our proposed late fusion of
CNN- and motion-based features can further increase the mean average precision
(mAP) on MED'14 from 34.95% to 38.74%. The fusion approach achieves the
state-of-the-art classification performance on the challenging UCF-101 dataset
SIGNet: Semantic Instance Aided Unsupervised 3D Geometry Perception
Unsupervised learning for geometric perception (depth, optical flow, etc.) is
of great interest to autonomous systems. Recent works on unsupervised learning
have made considerable progress on perceiving geometry; however, they usually
ignore the coherence of objects and perform poorly under scenarios with dark
and noisy environments. In contrast, supervised learning algorithms, which are
robust, require large labeled geometric dataset. This paper introduces SIGNet,
a novel framework that provides robust geometry perception without requiring
geometrically informative labels. Specifically, SIGNet integrates semantic
information to make depth and flow predictions consistent with objects and
robust to low lighting conditions. SIGNet is shown to improve upon the
state-of-the-art unsupervised learning for depth prediction by 30% (in squared
relative error). In particular, SIGNet improves the dynamic object class
performance by 39% in depth prediction and 29% in flow prediction. Our code
will be made available at https://github.com/mengyuest/SIGNetComment: To appear at CVPR 201
- …