244 research outputs found
Learning Depth with Convolutional Spatial Propagation Network
Depth prediction is one of the fundamental problems in computer vision. In
this paper, we propose a simple yet effective convolutional spatial propagation
network (CSPN) to learn the affinity matrix for various depth estimation tasks.
Specifically, it is an efficient linear propagation model, in which the
propagation is performed with a manner of recurrent convolutional operation,
and the affinity among neighboring pixels is learned through a deep
convolutional neural network (CNN). We can append this module to any output
from a state-of-the-art (SOTA) depth estimation networks to improve their
performances. In practice, we further extend CSPN in two aspects: 1) take
sparse depth map as additional input, which is useful for the task of depth
completion; 2) similar to commonly used 3D convolution operation in CNNs, we
propose 3D CSPN to handle features with one additional dimension, which is
effective in the task of stereo matching using 3D cost volume. For the tasks of
sparse to dense, a.k.a depth completion. We experimented the proposed CPSN
conjunct algorithms over the popular NYU v2 and KITTI datasets, where we show
that our proposed algorithms not only produce high quality (e.g., 30% more
reduction in depth error), but also run faster (e.g., 2 to 5x faster) than
previous SOTA spatial propagation network. We also evaluated our stereo
matching algorithm on the Scene Flow and KITTI Stereo datasets, and rank 1st on
both the KITTI Stereo 2012 and 2015 benchmarks, which demonstrates the
effectiveness of the proposed module. The code of CSPN proposed in this work
will be released at https://github.com/XinJCheng/CSPN.Comment: v1.2: add some exps v1.1: fixed some mistakes, v1: 17 pages, 12
figures. arXiv admin note: substantial text overlap with arXiv:1808.0015
RealPoint3D: Point Cloud Generation from a Single Image with Complex Background
3D point cloud generation by the deep neural network from a single image has
been attracting more and more researchers' attention. However,
recently-proposed methods require the objects be captured with relatively clean
backgrounds, fixed viewpoint, while this highly limits its application in the
real environment. To overcome these drawbacks, we proposed to integrate the
prior 3D shape knowledge into the network to guide the 3D generation. By taking
additional 3D information, the proposed network can handle the 3D object
generation from a single real image captured from any viewpoint and complex
background. Specifically, giving a query image, we retrieve the nearest shape
model from a pre-prepared 3D model database. Then, the image together with the
retrieved shape model is fed into the proposed network to generate the
fine-grained 3D point cloud. The effectiveness of our proposed framework has
been verified on different kinds of datasets. Experimental results show that
the proposed framework achieves state-of-the-art accuracy compared to other
volumetric-based and point set generation methods. Furthermore, the proposed
framework works well for real images in complex backgrounds with various view
angles.Comment: 8 pages, 6 figure
Detailed Human Shape Estimation from a Single Image by Hierarchical Mesh Deformation
This paper presents a novel framework to recover detailed human body shapes
from a single image. It is a challenging task due to factors such as variations
in human shapes, body poses, and viewpoints. Prior methods typically attempt to
recover the human body shape using a parametric based template that lacks the
surface details. As such the resulting body shape appears to be without
clothing. In this paper, we propose a novel learning-based framework that
combines the robustness of parametric model with the flexibility of free-form
3D deformation. We use the deep neural networks to refine the 3D shape in a
Hierarchical Mesh Deformation (HMD) framework, utilizing the constraints from
body joints, silhouettes, and per-pixel shading information. We are able to
restore detailed human body shapes beyond skinned models. Experiments
demonstrate that our method has outperformed previous state-of-the-art
approaches, achieving better accuracy in terms of both 2D IoU number and 3D
metric distance. The code is available in https://github.com/zhuhao-nju/hmd.gitComment: CVPR 2019 Ora
Safe Navigation with Human Instructions in Complex Scenes
In this paper, we present a robotic navigation algorithm with natural
language interfaces, which enables a robot to safely walk through a changing
environment with moving persons by following human instructions such as "go to
the restaurant and keep away from people". We first classify human instructions
into three types: the goal, the constraints, and uninformative phrases. Next,
we provide grounding for the extracted goal and constraint items in a dynamic
manner along with the navigation process, to deal with the target objects that
are too far away for sensor observation and the appearance of moving obstacles
like humans. In particular, for a goal phrase (e.g., "go to the restaurant"),
we ground it to a location in a predefined semantic map and treat it as a goal
for a global motion planner, which plans a collision-free path in the workspace
for the robot to follow. For a constraint phrase (e.g., "keep away from
people"), we dynamically add the corresponding constraint into a local planner
by adjusting the values of a local costmap according to the results returned by
the object detection module. The updated costmap is then used to compute a
local collision avoidance control for the safe navigation of the robot. By
combining natural language processing, motion planning, and computer vision,
our developed system is demonstrated to be able to successfully follow natural
language navigation instructions to achieve navigation tasks in both simulated
and real-world scenarios. Videos are available at
https://sites.google.com/view/snh
TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents
To safely and efficiently navigate in complex urban traffic, autonomous
vehicles must make responsible predictions in relation to surrounding
traffic-agents (vehicles, bicycles, pedestrians, etc.). A challenging and
critical task is to explore the movement patterns of different traffic-agents
and predict their future trajectories accurately to help the autonomous vehicle
make reasonable navigation decision. To solve this problem, we propose a long
short-term memory-based (LSTM-based) realtime traffic prediction algorithm,
TrafficPredict. Our approach uses an instance layer to learn instances'
movements and interactions and has a category layer to learn the similarities
of instances belonging to the same type to refine the prediction. In order to
evaluate its performance, we collected trajectory datasets in a large city
consisting of varying conditions and traffic densities. The dataset includes
many challenging scenarios where vehicles, bicycles, and pedestrians move among
one another. We evaluate the performance of TrafficPredict on our new dataset
and highlight its higher accuracy for trajectory prediction by comparing with
prior prediction methods.Comment: Accepted by AAAI(Oral) 201
The ApolloScape Open Dataset for Autonomous Driving and its Application
Autonomous driving has attracted tremendous attention especially in the past
few years. The key techniques for a self-driving car include solving tasks like
3D map construction, self-localization, parsing the driving road and
understanding objects, which enable vehicles to reason and act. However, large
scale data set for training and system evaluation is still a bottleneck for
developing robust perception models. In this paper, we present the ApolloScape
dataset [1] and its applications for autonomous driving. Compared with existing
public datasets from real scenes, e.g. KITTI [2] or Cityscapes [3], ApolloScape
contains much large and richer labelling including holistic semantic dense
point cloud for each site, stereo, per-pixel semantic labelling, lanemark
labelling, instance segmentation, 3D car instance, high accurate location for
every frame in various driving videos from multiple sites, cities and daytimes.
For each task, it contains at lease 15x larger amount of images than SOTA
datasets. To label such a complete dataset, we develop various tools and
algorithms specified for each task to accelerate the labelling process, such as
3D-2D segment labeling tools, active labelling in videos etc. Depend on
ApolloScape, we are able to develop algorithms jointly consider the learning
and inference of multiple tasks. In this paper, we provide a sensor fusion
scheme integrating camera videos, consumer-grade motion sensors (GPS/IMU), and
a 3D semantic map in order to achieve robust self-localization and semantic
segmentation for autonomous driving. We show that practically, sensor fusion
and joint learning of multiple tasks are beneficial to achieve a more robust
and accurate system. We expect our dataset and proposed relevant algorithms can
support and motivate researchers for further development of multi-sensor fusion
and multi-task learning in the field of computer vision.Comment: Version 4: Accepted by TPAMI. Version 3: 17 pages, 10 tables, 11
figures, added the application (DeLS-3D) based on the ApolloScape Dataset.
Version 2: 7 pages, 6 figures, added comparison with BDD100K datase
Salient Object Detection in the Deep Learning Era: An In-Depth Survey
As an essential problem in computer vision, salient object detection (SOD)
has attracted an increasing amount of research attention over the years. Recent
advances in SOD are predominantly led by deep learning-based solutions (named
deep SOD). To enable in-depth understanding of deep SOD, in this paper, we
provide a comprehensive survey covering various aspects, ranging from algorithm
taxonomy to unsolved issues. In particular, we first review deep SOD algorithms
from different perspectives, including network architecture, level of
supervision, learning paradigm, and object-/instance-level detection. Following
that, we summarize and analyze existing SOD datasets and evaluation metrics.
Then, we benchmark a large group of representative SOD models, and provide
detailed analyses of the comparison results. Moreover, we study the performance
of SOD algorithms under different attribute settings, which has not been
thoroughly explored previously, by constructing a novel SOD dataset with rich
attribute annotations covering various salient object types, challenging
factors, and scene categories. We further analyze, for the first time in the
field, the robustness of SOD models to random input perturbations and
adversarial attacks. We also look into the generalization and difficulty of
existing SOD datasets. Finally, we discuss several open issues of SOD and
outline future research directions.Comment: Published on IEEE TPAMI. All the saliency prediction maps, our
constructed dataset with annotations, and codes for evaluation are publicly
available at \url{https://github.com/wenguanwang/SODsurvey
Stereovision on GPU
Depth from stereo has traditionally been, and continues to be one of the most actively researched topics in computer vision. Recent development in this area has significantly advanced the state of the art in terms of quality. However, in terms of speed, these best stere
Getting Robots Unfrozen and Unlost in Dense Pedestrian Crowds
We aim to enable a mobile robot to navigate through environments with dense
crowds, e.g., shopping malls, canteens, train stations, or airport terminals.
In these challenging environments, existing approaches suffer from two common
problems: the robot may get frozen and cannot make any progress toward its
goal, or it may get lost due to severe occlusions inside a crowd. Here we
propose a navigation framework that handles the robot freezing and the
navigation lost problems simultaneously. First, we enhance the robot's mobility
and unfreeze the robot in the crowd using a reinforcement learning based local
navigation policy developed in our previous work~\cite{long2017towards}, which
naturally takes into account the coordination between the robot and the human.
Secondly, the robot takes advantage of its excellent local mobility to recover
from its localization failure. In particular, it dynamically chooses to
approach a set of recovery positions with rich features. To the best of our
knowledge, our method is the first approach that simultaneously solves the
freezing problem and the navigation lost problem in dense crowds. We evaluate
our method in both simulated and real-world environments and demonstrate that
it outperforms the state-of-the-art approaches. Videos are available at
https://sites.google.com/view/rlslam
Human Pose Estimation with Spatial Contextual Information
We explore the importance of spatial contextual information in human pose
estimation. Most state-of-the-art pose networks are trained in a multi-stage
manner and produce several auxiliary predictions for deep supervision. With
this principle, we present two conceptually simple and yet computational
efficient modules, namely Cascade Prediction Fusion (CPF) and Pose Graph Neural
Network (PGNN), to exploit underlying contextual information. Cascade
prediction fusion accumulates prediction maps from previous stages to extract
informative signals. The resulting maps also function as a prior to guide
prediction at following stages. To promote spatial correlation among joints,
our PGNN learns a structured representation of human pose as a graph. Direct
message passing between different joints is enabled and spatial relation is
captured. These two modules require very limited computational complexity.
Experimental results demonstrate that our method consistently outperforms
previous methods on MPII and LSP benchmark
- …