27 research outputs found

    Slanted Stixels: A way to represent steep streets

    Get PDF
    This work presents and evaluates a novel compact scene representation based on Stixels that infers geometric and semantic information. Our approach overcomes the previous rather restrictive geometric assumptions for Stixels by introducing a novel depth model to account for non-flat roads and slanted objects. Both semantic and depth cues are used jointly to infer the scene representation in a sound global energy minimization formulation. Furthermore, a novel approximation scheme is introduced in order to significantly reduce the computational complexity of the Stixel algorithm, and then achieve real-time computation capabilities. The idea is to first perform an over-segmentation of the image, discarding the unlikely Stixel cuts, and apply the algorithm only on the remaining Stixel cuts. This work presents a novel over-segmentation strategy based on a Fully Convolutional Network (FCN), which outperforms an approach based on using local extrema of the disparity map. We evaluate the proposed methods in terms of semantic and geometric accuracy as well as run-time on four publicly available benchmark datasets. Our approach maintains accuracy on flat road scene datasets while improving substantially on a novel non-flat road dataset.Comment: Journal preprint (published in IJCV 2019: https://link.springer.com/article/10.1007/s11263-019-01226-9). arXiv admin note: text overlap with arXiv:1707.0539

    Effects of Ground Manifold Modeling on the Accuracy of Stixel Calculations

    Get PDF
    This paper highlights the role of ground manifold modeling for stixel calculations; stixels are medium-level data representations used for the development of computer vision modules for self-driving cars. By using single-disparity maps and simplifying ground manifold models, calculated stixels may suffer from noise, inconsistency, and false-detection rates for obstacles, especially in challenging datasets. Stixel calculations can be improved with respect to accuracy and robustness by using more adaptive ground manifold approximations. A comparative study of stixel results, obtained for different ground-manifold models (e.g., plane-fitting, line-fitting in v-disparities or polynomial approximation, and graph cut), defines the main part of this paper. This paper also considers the use of trinocular stereo vision and shows that this provides options to enhance stixel results, compared with the binocular recording. Comprehensive experiments are performed on two publicly available challenging datasets. We also use a novel way for comparing calculated stixels with ground truth. We compare depth information, as given by extracted stixels, with ground-truth depth, provided by depth measurements using a highly accurate LiDAR range sensor (as available in one of the public datasets). We evaluate the accuracy of four different ground-manifold methods. The experimental results also include quantitative evaluations of the tradeoff between accuracy and run time. As a result, the proposed trinocular recording together with graph-cut estimation of ground manifolds appears to be a recommended way, also considering challenging weather and lighting conditions

    Combining Appearance, Depth and Motion for Efficient Semantic Scene Understanding

    Get PDF
    Computer vision plays a central role in autonomous vehicle technology, because cameras are comparably cheap and capture rich information about the environment. In particular, object classes, i.e. whether a certain object is a pedestrian, cyclist or vehicle can be extracted very well based on image data. Environment perception in urban city centers is a highly challenging computer vision problem, as the environment is very complex and cluttered: road boundaries and markings, traffic signs and lights and many different kinds of objects that can mutually occlude each other need to be detected in real-time. Existing automotive vision systems do not easily scale to these requirements, because every problem or object class is treated independently. Scene labeling on the other hand, which assigns object class information to every pixel in the image, is the most promising approach to avoid this overhead by sharing extracted features across multiple classes. Compared to bounding box detectors, scene labeling additionally provides richer and denser information about the environment. However, most existing scene labeling methods require a large amount of computational resources, which makes them infeasible for real-time in-vehicle applications. In addition, in terms of bandwidth, a dense pixel-level representation is not ideal to transmit the perceived environment to other modules of an autonomous vehicle, such as localization or path planning. This dissertation addresses the scene labeling problem in an automotive context by constructing a scene labeling concept around the "Stixel World" model of Pfeiffer (2011), which compresses dense information about the environment into a set of small "sticks" that stand upright, perpendicular to the ground plane. This work provides the first extension of the existing Stixel formulation that takes into account learned dense pixel-level appearance features. In a second step, Stixels are used as primitive scene elements to build a highly efficient region-level labeling scheme. The last part of this dissertation finally proposes a model that combines both pixel-level and region-level scene labeling into a single model that yields state-of-the-art or better labeling accuracy and can be executed in real-time with typical camera refresh rates. This work further investigates how existing depth information, i.e. from a stereo camera, can help to improve labeling accuracy and reduce runtime

    Understanding Cityscapes: Efficient Urban Semantic Scene Understanding

    Get PDF
    Semantic scene understanding plays a prominent role in the environment perception of autonomous vehicles. The car needs to be aware of the semantics of its surroundings. In particular it needs to sense other vehicles, bicycles, or pedestrians in order to predict their behavior. Knowledge of the drivable space is required for safe navigation and landmarks, such as poles, or static infrastructure such as buildings, form the basis for precise localization. In this work, we focus on visual scene understanding since cameras offer great potential for perceiving semantics while being comparably cheap; we also focus on urban scenarios as fully autonomous vehicles are expected to appear first in inner-city traffic. However, this task also comes with significant challenges. While images are rich in information, the semantics are not readily available and need to be extracted by means of computer vision, typically via machine learning methods. Furthermore, modern cameras have high resolution sensors as needed for high sensing ranges. As a consequence, large amounts of data need to be processed, while the processing simultaneously requires real-time speeds with low latency. In addition, the resulting semantic environment representation needs to be compressed to allow for fast transmission and down-stream processing. Additional challenges for the perception system arise from the scene type as urban scenes are typically highly cluttered, containing many objects at various scales that are often significantly occluded. In this dissertation, we address efficient urban semantic scene understanding for autonomous driving under three major perspectives. First, we start with an analysis of the potential of exploiting multiple input modalities, such as depth, motion, or object detectors, for semantic labeling as these cues are typically available in autonomous vehicles. Our goal is to integrate such data holistically throughout all processing stages and we show that our system outperforms comparable baseline methods, which confirms the value of multiple input modalities. Second, we aim to leverage modern deep learning methods requiring large amounts of supervised training data for street scene understanding. Therefore, we introduce Cityscapes, the first large-scale dataset and benchmark for urban scene understanding in terms of pixel- and instance-level semantic labeling. Based on this work, we compare various deep learning methods in terms of their performance on inner-city scenarios facing the challenges introduced above. Leveraging these insights, we combine suitable methods to obtain a real-time capable neural network for pixel-level semantic labeling with high classification accuracy. Third, we combine our previous results and aim for an integration of depth data from stereo vision and semantic information from deep learning methods by means of the Stixel World (Pfeiffer and Franke, 2011). To this end, we reformulate the Stixel World as a graphical model that provides a clear formalism, based on which we extend the formulation to multiple input modalities. We obtain a compact representation of the environment at real-time speeds that carries semantic as well as 3D information

    Translating Images into Maps

    Full text link
    We approach instantaneous mapping, converting images to a top-down view of the world, as a translation problem. We show how a novel form of transformer network can be used to map from images and video directly to an overhead map or bird's-eye-view (BEV) of the world, in a single end-to-end network. We assume a 1-1 correspondence between a vertical scanline in the image, and rays passing through the camera location in an overhead map. This lets us formulate map generation from an image as a set of sequence-to-sequence translations. Posing the problem as translation allows the network to use the context of the image when interpreting the role of each pixel. This constrained formulation, based upon a strong physical grounding of the problem, leads to a restricted transformer network that is convolutional in the horizontal direction only. The structure allows us to make efficient use of data when training, and obtains state-of-the-art results for instantaneous mapping of three large-scale datasets, including a 15% and 30% relative gain against existing best performing methods on the nuScenes and Argoverse datasets, respectively. We make our code available on https://github.com/avishkarsaha/translating-images-into-maps.Comment: Accepted to ICRA 202
    corecore