15,946 research outputs found
Residual Dense Network for Image Super-Resolution
A very deep convolutional neural network (CNN) has recently achieved great
success for image super-resolution (SR) and offered hierarchical features as
well. However, most deep CNN based SR models do not make full use of the
hierarchical features from the original low-resolution (LR) images, thereby
achieving relatively-low performance. In this paper, we propose a novel
residual dense network (RDN) to address this problem in image SR. We fully
exploit the hierarchical features from all the convolutional layers.
Specifically, we propose residual dense block (RDB) to extract abundant local
features via dense connected convolutional layers. RDB further allows direct
connections from the state of preceding RDB to all the layers of current RDB,
leading to a contiguous memory (CM) mechanism. Local feature fusion in RDB is
then used to adaptively learn more effective features from preceding and
current local features and stabilizes the training of wider network. After
fully obtaining dense local features, we use global feature fusion to jointly
and adaptively learn global hierarchical features in a holistic way. Extensive
experiments on benchmark datasets with different degradation models show that
our RDN achieves favorable performance against state-of-the-art methods.Comment: To appear in CVPR 2018 as spotligh
Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking
Most thermal infrared (TIR) tracking methods are discriminative, treating the
tracking problem as a classification task. However, the objective of the
classifier (label prediction) is not coupled to the objective of the tracker
(location estimation). The classification task focuses on the between-class
difference of the arbitrary objects, while the tracking task mainly deals with
the within-class difference of the same objects. In this paper, we cast the TIR
tracking problem as a similarity verification task, which is coupled well to
the objective of the tracking task. We propose a TIR tracker via a Hierarchical
Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To
obtain both spatial and semantic features of the TIR object, we design a
Siamese CNN that coalesces the multiple hierarchical convolutional layers.
Then, we propose a spatial-aware network to enhance the discriminative ability
of the coalesced hierarchical feature. Subsequently, we train this network end
to end on a large visible video detection dataset to learn the similarity
between paired objects before we transfer the network into the TIR domain.
Next, this pre-trained Siamese network is used to evaluate the similarity
between the target template and target candidates. Finally, we locate the
candidate that is most similar to the tracked target. Extensive experimental
results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed
method achieves favourable performance compared to the state-of-the-art
methods.Comment: 20 pages, 7 figure
Monocular Depth Estimation with Hierarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference
Monocular depth estimation is a challenging task in complex compositions
depicting multiple objects of diverse scales. Albeit the recent great progress
thanks to the deep convolutional neural networks (CNNs), the state-of-the-art
monocular depth estimation methods still fall short to handle such real-world
challenging scenarios. In this paper, we propose a deep end-to-end learning
framework to tackle these challenges, which learns the direct mapping from a
color image to the corresponding depth map. First, we represent monocular depth
estimation as a multi-category dense labeling task by contrast to the
regression based formulation. In this way, we could build upon the recent
progress in dense labeling such as semantic segmentation. Second, we fuse
different side-outputs from our front-end dilated convolutional neural network
in a hierarchical way to exploit the multi-scale depth cues for depth
estimation, which is critical to achieve scale-aware depth estimation. Third,
we propose to utilize soft-weighted-sum inference instead of the hard-max
inference, transforming the discretized depth score to continuous depth value.
Thus, we reduce the influence of quantization error and improve the robustness
of our method. Extensive experiments on the NYU Depth V2 and KITTI datasets
show the superiority of our method compared with current state-of-the-art
methods. Furthermore, experiments on the NYU V2 dataset reveal that our model
is able to learn the probability distribution of depth
Tucker Decomposition Network: Expressive Power and Comparison
Deep neural networks have achieved a great success in solving many machine
learning and computer vision problems. The main contribution of this paper is
to develop a deep network based on Tucker tensor decomposition, and analyze its
expressive power. It is shown that the expressiveness of Tucker network is more
powerful than that of shallow network. In general, it is required to use an
exponential number of nodes in a shallow network in order to represent a Tucker
network. Experimental results are also given to compare the performance of the
proposed Tucker network with hierarchical tensor network and shallow network,
and demonstrate the usefulness of Tucker network in image classification
problems
Unconstrained Fashion Landmark Detection via Hierarchical Recurrent Transformer Networks
Fashion landmarks are functional key points defined on clothes, such as
corners of neckline, hemline, and cuff. They have been recently introduced as
an effective visual representation for fashion image understanding. However,
detecting fashion landmarks are challenging due to background clutters, human
poses, and scales. To remove the above variations, previous works usually
assumed bounding boxes of clothes are provided in training and test as
additional annotations, which are expensive to obtain and inapplicable in
practice. This work addresses unconstrained fashion landmark detection, where
clothing bounding boxes are not provided in both training and test. To this
end, we present a novel Deep LAndmark Network (DLAN), where bounding boxes and
landmarks are jointly estimated and trained iteratively in an end-to-end
manner. DLAN contains two dedicated modules, including a Selective Dilated
Convolution for handling scale discrepancies, and a Hierarchical Recurrent
Spatial Transformer for handling background clutters. To evaluate DLAN, we
present a large-scale fashion landmark dataset, namely Unconstrained Landmark
Database (ULD), consisting of 30K images. Statistics show that ULD is more
challenging than existing datasets in terms of image scales, background
clutters, and human poses. Extensive experiments demonstrate the effectiveness
of DLAN over the state-of-the-art methods. DLAN also exhibits excellent
generalization across different clothing categories and modalities, making it
extremely suitable for real-world fashion analysis.Comment: To appear in ACM Multimedia (ACM MM) 2017 as a full research paper.
More details at the project page:
http://personal.ie.cuhk.edu.hk/~lz013/projects/UnconstrainedLandmarks.htm
Deep Structured Models For Group Activity Recognition
This paper presents a deep neural-network-based hierarchical graphical model
for individual and group activity recognition in surveillance scenes. Deep
networks are used to recognize the actions of individual people in a scene.
Next, a neural-network-based hierarchical graphical model refines the predicted
labels for each class by considering dependencies between the classes. This
refinement step mimics a message-passing step similar to inference in a
probabilistic graphical model. We show that this approach can be effective in
group activity recognition, with the deep graphical model improving recognition
rates over baseline methods
Hierarchical Cellular Automata for Visual Saliency
Saliency detection, finding the most important parts of an image, has become
increasingly popular in computer vision. In this paper, we introduce
Hierarchical Cellular Automata (HCA) -- a temporally evolving model to
intelligently detect salient objects. HCA consists of two main components:
Single-layer Cellular Automata (SCA) and Cuboid Cellular Automata (CCA). As an
unsupervised propagation mechanism, Single-layer Cellular Automata can exploit
the intrinsic relevance of similar regions through interactions with neighbors.
Low-level image features as well as high-level semantic information extracted
from deep neural networks are incorporated into the SCA to measure the
correlation between different image patches. With these hierarchical deep
features, an impact factor matrix and a coherence matrix are constructed to
balance the influences on each cell's next state. The saliency values of all
cells are iteratively updated according to a well-defined update rule.
Furthermore, we propose CCA to integrate multiple saliency maps generated by
SCA at different scales in a Bayesian framework. Therefore, single-layer
propagation and multi-layer integration are jointly modeled in our unified HCA.
Surprisingly, we find that the SCA can improve all existing methods that we
applied it to, resulting in a similar precision level regardless of the
original results. The CCA can act as an efficient pixel-wise aggregation
algorithm that can integrate state-of-the-art methods, resulting in even better
results. Extensive experiments on four challenging datasets demonstrate that
the proposed algorithm outperforms state-of-the-art conventional methods and is
competitive with deep learning based approaches
Channel Splitting Network for Single MR Image Super-Resolution
High resolution magnetic resonance (MR) imaging is desirable in many clinical
applications due to its contribution to more accurate subsequent analyses and
early clinical diagnoses. Single image super resolution (SISR) is an effective
and cost efficient alternative technique to improve the spatial resolution of
MR images. In the past few years, SISR methods based on deep learning
techniques, especially convolutional neural networks (CNNs), have achieved
state-of-the-art performance on natural images. However, the information is
gradually weakened and training becomes increasingly difficult as the network
deepens. The problem is more serious for medical images because lacking high
quality and effective training samples makes deep models prone to underfitting
or overfitting. Nevertheless, many current models treat the hierarchical
features on different channels equivalently, which is not helpful for the
models to deal with the hierarchical features discriminatively and targetedly.
To this end, we present a novel channel splitting network (CSN) to ease the
representational burden of deep models. The proposed CSN model divides the
hierarchical features into two branches, i.e., residual branch and dense
branch, with different information transmissions. The residual branch is able
to promote feature reuse, while the dense branch is beneficial to the
exploration of new features. Besides, we also adopt the merge-and-run mapping
to facilitate information integration between different branches. Extensive
experiments on various MR images, including proton density (PD), T1 and T2
images, show that the proposed CSN model achieves superior performance over
other state-of-the-art SISR methods.Comment: 13 pages, 11 figures and 4 table
Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Computational saliency models for still images have gained significant
popularity in recent years. Saliency prediction from videos, on the other hand,
has received relatively little interest from the community. Motivated by this,
in this work, we study the use of deep learning for dynamic saliency prediction
and propose the so-called spatio-temporal saliency networks. The key to our
models is the architecture of two-stream networks where we investigate
different fusion mechanisms to integrate spatial and temporal information. We
evaluate our models on the DIEM and UCF-Sports datasets and present highly
competitive results against the existing state-of-the-art models. We also carry
out some experiments on a number of still images from the MIT300 dataset by
exploiting the optical flow maps predicted from these images. Our results show
that considering inherent motion information in this way can be helpful for
static saliency estimation
Spatiotemporal Recurrent Convolutional Networks for Recognizing Spontaneous Micro-expressions
Recently, the recognition task of spontaneous facial micro-expressions has
attracted much attention with its various real-world applications. Plenty of
handcrafted or learned features have been employed for a variety of classifiers
and achieved promising performances for recognizing micro-expressions. However,
the micro-expression recognition is still challenging due to the subtle
spatiotemporal changes of micro-expressions. To exploit the merits of deep
learning, we propose a novel deep recurrent convolutional networks based
micro-expression recognition approach, capturing the spatial-temporal
deformations of micro-expression sequence. Specifically, the proposed deep
model is constituted of several recurrent convolutional layers for extracting
visual features and a classificatory layer for recognition. It is optimized by
an end-to-end manner and obviates manual feature design. To handle sequential
data, we exploit two types of extending the connectivity of convolutional
networks across temporal domain, in which the spatiotemporal deformations are
modeled in views of facial appearance and geometry separately. Besides, to
overcome the shortcomings of limited and imbalanced training samples, temporal
data augmentation strategies as well as a balanced loss are jointly used for
our deep network. By performing the experiments on three spontaneous
micro-expression datasets, we verify the effectiveness of our proposed
micro-expression recognition approach compared to the state-of-the-art methods.Comment: Submitted to IEEE TM
- …