70,803 research outputs found
High Frequency Residual Learning for Multi-Scale Image Classification
We present a novel high frequency residual learning framework, which leads to
a highly efficient multi-scale network (MSNet) architecture for mobile and
embedded vision problems. The architecture utilizes two networks: a low
resolution network to efficiently approximate low frequency components and a
high resolution network to learn high frequency residuals by reusing the
upsampled low resolution features. With a classifier calibration module, MSNet
can dynamically allocate computation resources during inference to achieve a
better speed and accuracy trade-off. We evaluate our methods on the challenging
ImageNet-1k dataset and observe consistent improvements over different base
networks. On ResNet-18 and MobileNet with alpha=1.0, MSNet gains 1.5% accuracy
over both architectures without increasing computations. On the more efficient
MobileNet with alpha=0.25, our method gains 3.8% accuracy with the same amount
of computations
Scale Optimization for Full-Image-CNN Vehicle Detection
Many state-of-the-art general object detection methods make use of shared
full-image convolutional features (as in Faster R-CNN). This achieves a
reasonable test-phase computation time while enjoys the discriminative power
provided by large Convolutional Neural Network (CNN) models. Such designs excel
on benchmarks which contain natural images but which have very unnatural
distributions, i.e. they have an unnaturally high-frequency of the target
classes and a bias towards a "friendly" or "dominant" object scale. In this
paper we present further study of the use and adaptation of the Faster R-CNN
object detection method for datasets presenting natural scale distribution and
unbiased real-world object frequency. In particular, we show that better
alignment of the detector scale sensitivity to the extant distribution improves
vehicle detection performance. We do this by modifying both the selection of
Region Proposals, and through using more scale-appropriate full-image
convolution features within the CNN model. By selecting better scales in the
region proposal input and by combining feature maps through careful design of
the convolutional neural network, we improve performance on smaller objects. We
significantly increase detection AP for the KITTI dataset car class from 76.3%
on our baseline Faster R-CNN detector to 83.6% in our improved detector.Comment: Accepted by 2017 IEEE Intelligent Vehicles Symposium (IV). Link:
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=799581
Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
In natural images, information is conveyed at different frequencies where
higher frequencies are usually encoded with fine details and lower frequencies
are usually encoded with global structures. Similarly, the output feature maps
of a convolution layer can also be seen as a mixture of information at
different frequencies. In this work, we propose to factorize the mixed feature
maps by their frequencies, and design a novel Octave Convolution (OctConv)
operation to store and process feature maps that vary spatially "slower" at a
lower spatial resolution reducing both memory and computation cost. Unlike
existing multi-scale methods, OctConv is formulated as a single, generic,
plug-and-play convolutional unit that can be used as a direct replacement of
(vanilla) convolutions without any adjustments in the network architecture. It
is also orthogonal and complementary to methods that suggest better topologies
or reduce channel-wise redundancy like group or depth-wise convolutions. We
experimentally show that by simply replacing convolutions with OctConv, we can
consistently boost accuracy for both image and video recognition tasks, while
reducing memory and computational cost. An OctConv-equipped ResNet-152 can
achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2
GFLOPs.Comment: Accepted to ICCV 201
Detailed Dense Inference with Convolutional Neural Networks via Discrete Wavelet Transform
Dense pixelwise prediction such as semantic segmentation is an up-to-date
challenge for deep convolutional neural networks (CNNs). Many state-of-the-art
approaches either tackle the loss of high-resolution information due to pooling
in the encoder stage, or use dilated convolutions or high-resolution lanes to
maintain detailed feature maps and predictions. Motivated by the structural
analogy between multi-resolution wavelet analysis and the pooling/unpooling
layers of CNNs, we introduce discrete wavelet transform (DWT) into the CNN
encoder-decoder architecture and propose WCNN. The high-frequency wavelet
coefficients are computed at encoder, which are later used at the decoder to
unpooled jointly with coarse-resolution feature maps through the inverse DWT.
The DWT/iDWT is further used to develop two wavelet pyramids to capture the
global context, where the multi-resolution DWT is applied to successively
reduce the spatial resolution and increase the receptive field. Experiment with
the Cityscape dataset, the proposed WCNNs are computationally efficient and
yield improvements the accuracy for high-resolution dense pixelwise prediction.Comment: This work was first submitted to NIPS 2017, May 201
Single Image Super-resolution via a Lightweight Residual Convolutional Neural Network
Recent years have witnessed great success of convolutional neural network
(CNN) for various problems both in low and high level visions. Especially
noteworthy is the residual network which was originally proposed to handle
high-level vision problems and enjoys several merits. This paper aims to extend
the merits of residual network, such as skip connection induced fast training,
for a typical low-level vision problem, i.e., single image super-resolution. In
general, the two main challenges of existing deep CNN for supper-resolution lie
in the gradient exploding/vanishing problem and large numbers of parameters or
computational cost as CNN goes deeper. Correspondingly, the skip connections or
identity mapping shortcuts are utilized to avoid gradient exploding/vanishing
problem. In addition, the skip connections have naturally centered the
activation which led to better performance. To tackle with the second problem,
a lightweight CNN architecture which has carefully designed width, depth and
skip connections was proposed. In particular, a strategy of gradually varying
the shape of network has been proposed for residual network. Different residual
architectures for image super-resolution have also been compared. Experimental
results have demonstrated that the proposed CNN model can not only achieve
state-of-the-art PSNR and SSIM results for single image super-resolution but
also produce visually pleasant results. This paper has extended the mmm 2017
oral conference paper with a considerable new analyses and more experiments
especially from the perspective of centering activations and ensemble behaviors
of residual network.Comment: Extentions of mmm 2017 pape
Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition
In this paper, we propose a novel Convolutional Neural Network (CNN)
architecture for learning multi-scale feature representations with good
tradeoffs between speed and accuracy. This is achieved by using a multi-branch
network, which has different computational complexity at different branches.
Through frequent merging of features from branches at distinct scales, our
model obtains multi-scale features while using less computation. The proposed
approach demonstrates improvement of model efficiency and performance on both
object recognition and speech recognition tasks,using popular architectures
including ResNet and ResNeXt. For object recognition, our approach reduces
computation by 33% on object recognition while improving accuracy with 0.9%.
Furthermore, our model surpasses state-of-the-art CNN acceleration approaches
by a large margin in accuracy and FLOPs reduction. On the task of speech
recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better
word error rates, showing good generalization across domains. The codes are
available at https://github.com/IBM/BigLittleNetComment: git repo: https://github.com/IBM/BigLittleNe
3DSRnet: Video Super-resolution using 3D Convolutional Neural Networks
In video super-resolution, the spatio-temporal coherence between, and among
the frames must be exploited appropriately for accurate prediction of the high
resolution frames. Although 2D convolutional neural networks (CNNs) are
powerful in modelling images, 3D-CNNs are more suitable for spatio-temporal
feature extraction as they can preserve temporal information. To this end, we
propose an effective 3D-CNN for video super-resolution, called the 3DSRnet that
does not require motion alignment as preprocessing. Our 3DSRnet maintains the
temporal depth of spatio-temporal feature maps to maximally capture the
temporally nonlinear characteristics between low and high resolution frames,
and adopts residual learning in conjunction with the sub-pixel outputs. It
outperforms the most state-of-the-art method with average 0.45 and 0.36 dB
higher in PSNR for scales 3 and 4, respectively, in the Vidset4 benchmark. Our
3DSRnet first deals with the performance drop due to scene change, which is
important in practice but has not been previously considered.Comment: Extension of our paper accepted at ICIP 201
Deep Learning-based Image Super-Resolution Considering Quantitative and Perceptual Quality
Recently, it has been shown that in super-resolution, there exists a tradeoff
relationship between the quantitative and perceptual quality of super-resolved
images, which correspond to the similarity to the ground-truth images and the
naturalness, respectively. In this paper, we propose a novel super-resolution
method that can improve the perceptual quality of the upscaled images while
preserving the conventional quantitative performance. The proposed method
employs a deep network for multi-pass upscaling in company with a discriminator
network and two quantitative score predictor networks. Experimental results
demonstrate that the proposed method achieves a good balance of the
quantitative and perceptual quality, showing more satisfactory results than
existing methods.Comment: Won the 2nd place for Region 2 in the PIRM Challenge on Perceptual
Super Resolution at ECCV 2018. GitHub at
https://github.com/idearibosome/tf-perceptual-eus
Skeleton-Based Action Recognition with Synchronous Local and Non-local Spatio-temporal Learning and Frequency Attention
Benefiting from its succinctness and robustness, skeleton-based action
recognition has recently attracted much attention. Most existing methods
utilize local networks (e.g., recurrent, convolutional, and graph convolutional
networks) to extract spatio-temporal dynamics hierarchically. As a consequence,
the local and non-local dependencies, which contain more details and semantics
respectively, are asynchronously captured in different level of layers.
Moreover, existing methods are limited to the spatio-temporal domain and ignore
information in the frequency domain. To better extract synchronous detailed and
semantic information from multi-domains, we propose a residual frequency
attention (rFA) block to focus on discriminative patterns in the frequency
domain, and a synchronous local and non-local (SLnL) block to simultaneously
capture the details and semantics in the spatio-temporal domain. Besides, a
soft-margin focal loss (SMFL) is proposed to optimize the learning whole
process, which automatically conducts data selection and encourages intrinsic
margins in classifiers. Our approach significantly outperforms other
state-of-the-art methods on several large-scale datasets.Comment: 6 pages,4 figures; accepted to ICME 201
A Deep Journey into Super-resolution: A survey
Deep convolutional networks based super-resolution is a fast-growing field
with numerous practical applications. In this exposition, we extensively
compare 30+ state-of-the-art super-resolution Convolutional Neural Networks
(CNNs) over three classical and three recently introduced challenging datasets
to benchmark single image super-resolution. We introduce a taxonomy for
deep-learning based super-resolution networks that groups existing methods into
nine categories including linear, residual, multi-branch, recursive,
progressive, attention-based and adversarial designs. We also provide
comparisons between the models in terms of network complexity, memory
footprint, model input and output, learning details, the type of network losses
and important architectural differences (e.g., depth, skip-connections,
filters). The extensive evaluation performed, shows the consistent and rapid
growth in the accuracy in the past few years along with a corresponding boost
in model complexity and the availability of large-scale datasets. It is also
observed that the pioneering methods identified as the benchmark have been
significantly outperformed by the current contenders. Despite the progress in
recent years, we identify several shortcomings of existing techniques and
provide future research directions towards the solution of these open problems.Comment: Accepted in ACM Computing Survey
- …