247,273 research outputs found
Deep Saliency with Encoded Low level Distance Map and High Level Features
Recent advances in saliency detection have utilized deep learning to obtain
high level features to detect salient regions in a scene. These advances have
demonstrated superior results over previous works that utilize hand-crafted low
level features for saliency detection. In this paper, we demonstrate that
hand-crafted features can provide complementary information to enhance
performance of saliency detection that utilizes only high level features. Our
method utilizes both high level and low level features for saliency detection
under a unified deep learning framework. The high level features are extracted
using the VGG-net, and the low level features are compared with other parts of
an image to form a low level distance map. The low level distance map is then
encoded using a convolutional neural network(CNN) with multiple 1X1
convolutional and ReLU layers. We concatenate the encoded low level distance
map and the high level features, and connect them to a fully connected neural
network classifier to evaluate the saliency of a query region. Our experiments
show that our method can further improve the performance of state-of-the-art
deep learning-based saliency detection methods.Comment: Accepted by IEEE Conference on Computer Vision and Pattern
Recognition(CVPR) 2016. Project page:
https://github.com/gylee1103/SaliencyEL
SPLINE-Net: Sparse Photometric Stereo through Lighting Interpolation and Normal Estimation Networks
This paper solves the Sparse Photometric stereo through Lighting
Interpolation and Normal Estimation using a generative Network (SPLINE-Net).
SPLINE-Net contains a lighting interpolation network to generate dense lighting
observations given a sparse set of lights as inputs followed by a normal
estimation network to estimate surface normals. Both networks are jointly
constrained by the proposed symmetric and asymmetric loss functions to enforce
isotropic constrain and perform outlier rejection of global illumination
effects. SPLINE-Net is verified to outperform existing methods for photometric
stereo of general BRDFs by using only ten images of different lights instead of
using nearly one hundred images.Comment: Accepted to ICCV 201
Augmented Semantic Signatures of Airborne LiDAR Point Clouds for Comparison
LiDAR point clouds provide rich geometric information, which is particularly
useful for the analysis of complex scenes of urban regions. Finding structural
and semantic differences between two different three-dimensional point clouds,
say, of the same region but acquired at different time instances is an
important problem. A comparison of point clouds involves computationally
expensive registration and segmentation. We are interested in capturing the
relative differences in the geometric uncertainty and semantic content of the
point cloud without the registration process. Hence, we propose an
orientation-invariant geometric signature of the point cloud, which integrates
its probabilistic geometric and semantic classifications. We study different
properties of the geometric signature, which are an image-based encoding of
geometric uncertainty and semantic content. We explore different metrics to
determine differences between these signatures, which in turn compare point
clouds without performing point-to-point registration. Our results show that
the differences in the signatures corroborate with the geometric and semantic
differences of the point clouds.Comment: 18 pages, 6 figures, 1 tabl
Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving
In this paper, we propose a monocular 3D object detection framework in the
domain of autonomous driving. Unlike previous image-based methods which focus
on RGB feature extracted from 2D images, our method solves this problem in the
reconstructed 3D space in order to exploit 3D contexts explicitly. To this end,
we first leverage a stand-alone module to transform the input data from 2D
image plane to 3D point clouds space for a better input representation, then we
perform the 3D detection using PointNet backbone net to obtain objects 3D
locations, dimensions and orientations. To enhance the discriminative
capability of point clouds, we propose a multi-modal feature fusion module to
embed the complementary RGB cue into the generated point clouds representation.
We argue that it is more effective to infer the 3D bounding boxes from the
generated 3D scene space (i.e., X,Y, Z space) compared to the image plane
(i.e., R,G,B image plane). Evaluation on the challenging KITTI dataset shows
that our approach boosts the performance of state-of-the-art monocular approach
by a large margin.Comment: To appear in ICCV'1
Unsupervised Object Discovery and Co-Localization by Deep Descriptor Transforming
Reusable model design becomes desirable with the rapid expansion of computer
vision and machine learning applications. In this paper, we focus on the
reusability of pre-trained deep convolutional models. Specifically, different
from treating pre-trained models as feature extractors, we reveal more
treasures beneath convolutional layers, i.e., the convolutional activations
could act as a detector for the common object in the image co-localization
problem. We propose a simple yet effective method, termed Deep Descriptor
Transforming (DDT), for evaluating the correlations of descriptors and then
obtaining the category-consistent regions, which can accurately locate the
common object in a set of unlabeled images, i.e., unsupervised object
discovery. Empirical studies validate the effectiveness of the proposed DDT
method. On benchmark image co-localization datasets, DDT consistently
outperforms existing state-of-the-art methods by a large margin. Moreover, DDT
also demonstrates good generalization ability for unseen categories and
robustness for dealing with noisy data. Beyond those, DDT can be also employed
for harvesting web images into valid external data sources for improving
performance of both image recognition and object detection.Comment: This paper is extended based on our preliminary work published in
IJCAI 2017 [arXiv:1705.02758
Attention-guided Network for Ghost-free High Dynamic Range Imaging
Ghosting artifacts caused by moving objects or misalignments is a key
challenge in high dynamic range (HDR) imaging for dynamic scenes. Previous
methods first register the input low dynamic range (LDR) images using optical
flow before merging them, which are error-prone and cause ghosts in results. A
very recent work tries to bypass optical flows via a deep network with
skip-connections, however, which still suffers from ghosting artifacts for
severe movement. To avoid the ghosting from the source, we propose a novel
attention-guided end-to-end deep neural network (AHDRNet) to produce
high-quality ghost-free HDR images. Unlike previous methods directly stacking
the LDR images or features for merging, we use attention modules to guide the
merging according to the reference image. The attention modules automatically
suppress undesired components caused by misalignments and saturation and
enhance desirable fine details in the non-reference images. In addition to the
attention model, we use dilated residual dense block (DRDB) to make full use of
the hierarchical features and increase the receptive field for hallucinating
the missing details. The proposed AHDRNet is a non-flow-based method, which can
also avoid the artifacts generated by optical-flow estimation error.
Experiments on different datasets show that the proposed AHDRNet can achieve
state-of-the-art quantitative and qualitative results.Comment: Accepted to appear at CVPR 201
Learning the Depths of Moving People by Watching Frozen People
We present a method for predicting dense depth in scenarios where both a
monocular camera and people in the scene are freely moving. Existing methods
for recovering depth for dynamic, non-rigid objects from monocular video impose
strong assumptions on the objects' motion and may only recover sparse depth. In
this paper, we take a data-driven approach and learn human depth priors from a
new source of data: thousands of Internet videos of people imitating
mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera
tours the scene. Because people are stationary, training data can be generated
using multi-view stereo reconstruction. At inference time, our method uses
motion parallax cues from the static areas of the scenes to guide the depth
prediction. We demonstrate our method on real-world sequences of complex human
actions captured by a moving hand-held camera, show improvement over
state-of-the-art monocular depth prediction methods, and show various 3D
effects produced using our predicted depth.Comment: CVPR 2019 (Oral
Structure-Preserving Image Super-resolution via Contextualized Multi-task Learning
Single image super resolution (SR), which refers to reconstruct a
higher-resolution (HR) image from the observed low-resolution (LR) image, has
received substantial attention due to its tremendous application potentials.
Despite the breakthroughs of recently proposed SR methods using convolutional
neural networks (CNNs), their generated results usually lack of preserving
structural (high-frequency) details. In this paper, regarding global boundary
context and residual context as complimentary information for enhancing
structural details in image restoration, we develop a contextualized multi-task
learning framework to address the SR problem. Specifically, our method first
extracts convolutional features from the input LR image and applies one
deconvolutional module to interpolate the LR feature maps in a content-adaptive
way. Then, the resulting feature maps are fed into two branched sub-networks.
During the neural network training, one sub-network outputs salient image
boundaries and the HR image, and the other sub-network outputs the local
residual map, i.e., the residual difference between the generated HR image and
ground-truth image. On several standard benchmarks (i.e., Set5, Set14 and
BSD200), our extensive evaluations demonstrate the effectiveness of our SR
method on achieving both higher restoration quality and computational
efficiency compared with several state-of-the-art SR approaches. The source
code and some SR results can be found at:
http://hcp.sysu.edu.cn/structure-preserving-image-super-resolution/Comment: To appear in Transactions on Multimedia 201
Real-world Underwater Enhancement: Challenges, Benchmarks, and Solutions
Underwater image enhancement is such an important low-level vision task with
many applications that numerous algorithms have been proposed in recent years.
These algorithms developed upon various assumptions demonstrate successes from
various aspects using different data sets and different metrics. In this work,
we setup an undersea image capturing system, and construct a large-scale
Real-world Underwater Image Enhancement (RUIE) data set divided into three
subsets. The three subsets target at three challenging aspects for enhancement,
i.e., image visibility quality, color casts, and higher-level
detection/classification, respectively. We conduct extensive and systematic
experiments on RUIE to evaluate the effectiveness and limitations of various
algorithms to enhance visibility and correct color casts on images with
hierarchical categories of degradation. Moreover, underwater image enhancement
in practice usually serves as a preprocessing step for mid-level and high-level
vision tasks. We thus exploit the object detection performance on enhanced
images as a brand new task-specific evaluation criterion. The findings from
these evaluations not only confirm what is commonly believed, but also suggest
promising solutions and new directions for visibility enhancement, color
correction, and object detection on real-world underwater images.Comment: arXiv admin note: text overlap with arXiv:1712.04143 by other author
CaseNet: Content-Adaptive Scale Interaction Networks for Scene Parsing
Objects in an image exhibit diverse scales. Adaptive receptive fields are
expected to catch suitable range of context for accurate pixel level semantic
prediction for handling objects of diverse sizes. Recently, atrous convolution
with different dilation rates has been used to generate features of
multi-scales through several branches and these features are fused for
prediction. However, there is a lack of explicit interaction among the branches
to adaptively make full use of the contexts. In this paper, we propose a
Content-Adaptive Scale Interaction Network (CaseNet) to exploit the multi-scale
features for scene parsing. We build the CaseNet based on the classic Atrous
Spatial Pyramid Pooling (ASPP) module, followed by the proposed contextual
scale interaction (CSI) module, and the scale adaptation (SA) module.
Specifically, first, for each spatial position, we enable context interaction
among different scales through scale-aware non-local operations across the
scales, \ie, CSI module, which facilitates the generation of flexible mixed
receptive fields, instead of a traditional flat one. Second, the scale
adaptation module (SA) explicitly and softly selects the suitable scale for
each spatial position and each channel. Ablation studies demonstrate the
effectiveness of the proposed modules. We achieve state-of-the-art performance
on three scene parsing benchmarks Cityscapes, ADE20K and LIP
- …