2,171 research outputs found
The Cityscapes Dataset for Semantic Urban Scene Understanding
Visual understanding of complex urban street scenes is an enabling factor for
a wide range of applications. Object detection has benefited enormously from
large-scale datasets, especially in the context of deep learning. For semantic
urban scene understanding, however, no current dataset adequately captures the
complexity of real-world urban scenes.
To address this, we introduce Cityscapes, a benchmark suite and large-scale
dataset to train and test approaches for pixel-level and instance-level
semantic labeling. Cityscapes is comprised of a large, diverse set of stereo
video sequences recorded in streets from 50 different cities. 5000 of these
images have high quality pixel-level annotations; 20000 additional images have
coarse annotations to enable methods that leverage large volumes of
weakly-labeled data. Crucially, our effort exceeds previous attempts in terms
of dataset size, annotation richness, scene variability, and complexity. Our
accompanying empirical study provides an in-depth analysis of the dataset
characteristics, as well as a performance evaluation of several
state-of-the-art approaches based on our benchmark.Comment: Includes supplemental materia
DeepSignals: Predicting Intent of Drivers Through Visual Signals
Detecting the intention of drivers is an essential task in self-driving,
necessary to anticipate sudden events like lane changes and stops. Turn signals
and emergency flashers communicate such intentions, providing seconds of
potentially critical reaction time. In this paper, we propose to detect these
signals in video sequences by using a deep neural network that reasons about
both spatial and temporal information. Our experiments on more than a million
frames show high per-frame accuracy in very challenging scenarios.Comment: To be presented at the IEEE International Conference on Robotics and
Automation (ICRA), 201
Joint Optical Flow and Temporally Consistent Semantic Segmentation
The importance and demands of visual scene understanding have been steadily
increasing along with the active development of autonomous systems.
Consequently, there has been a large amount of research dedicated to semantic
segmentation and dense motion estimation. In this paper, we propose a method
for jointly estimating optical flow and temporally consistent semantic
segmentation, which closely connects these two problem domains and leverages
each other. Semantic segmentation provides information on plausible physical
motion to its associated pixels, and accurate pixel-level temporal
correspondences enhance the accuracy of semantic segmentation in the temporal
domain. We demonstrate the benefits of our approach on the KITTI benchmark,
where we observe performance gains for flow and segmentation. We achieve
state-of-the-art optical flow results, and outperform all published algorithms
by a large margin on challenging, but crucial dynamic objects.Comment: 14 pages, Accepted for CVRSUAD workshop at ECCV 201
Geometry-Based Next Frame Prediction from Monocular Video
We consider the problem of next frame prediction from video input. A
recurrent convolutional neural network is trained to predict depth from
monocular video input, which, along with the current video image and the camera
trajectory, can then be used to compute the next frame. Unlike prior next-frame
prediction approaches, we take advantage of the scene geometry and use the
predicted depth for generating the next frame prediction. Our approach can
produce rich next frame predictions which include depth information attached to
each pixel. Another novel aspect of our approach is that it predicts depth from
a sequence of images (e.g. in a video), rather than from a single still image.
We evaluate the proposed approach on the KITTI dataset, a standard dataset for
benchmarking tasks relevant to autonomous driving. The proposed method produces
results which are visually and numerically superior to existing methods that
directly predict the next frame. We show that the accuracy of depth prediction
improves as more prior frames are considered.Comment: To appear in 2017 IEEE Intelligent Vehicles Symposiu
Layered Interpretation of Street View Images
We propose a layered street view model to encode both depth and semantic
information on street view images for autonomous driving. Recently, stixels,
stix-mantics, and tiered scene labeling methods have been proposed to model
street view images. We propose a 4-layer street view model, a compact
representation over the recently proposed stix-mantics model. Our layers encode
semantic classes like ground, pedestrians, vehicles, buildings, and sky in
addition to the depths. The only input to our algorithm is a pair of stereo
images. We use a deep neural network to extract the appearance features for
semantic classes. We use a simple and an efficient inference algorithm to
jointly estimate both semantic classes and layered depth values. Our method
outperforms other competing approaches in Daimler urban scene segmentation
dataset. Our algorithm is massively parallelizable, allowing a GPU
implementation with a processing speed about 9 fps.Comment: The paper will be presented in the 2015 Robotics: Science and Systems
Conference (RSS
- …