2,901 research outputs found
Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices
For many applications in low-power real-time robotics, stereo cameras are the
sensors of choice for depth perception as they are typically cheaper and more
versatile than their active counterparts. Their biggest drawback, however, is
that they do not directly sense depth maps; instead, these must be estimated
through data-intensive processes. Therefore, appropriate algorithm selection
plays an important role in achieving the desired performance characteristics.
Motivated by applications in space and mobile robotics, we implement and
evaluate a FPGA-accelerated adaptation of the ELAS algorithm. Despite offering
one of the best trade-offs between efficiency and accuracy, ELAS has only been
shown to run at 1.5-3 fps on a high-end CPU. Our system preserves all
intriguing properties of the original algorithm, such as the slanted plane
priors, but can achieve a frame rate of 47fps whilst consuming under 4W of
power. Unlike previous FPGA based designs, we take advantage of both components
on the CPU/FPGA System-on-Chip to showcase the strategy necessary to accelerate
more complex and computationally diverse algorithms for such low power,
real-time systems.Comment: 8 pages, 7 figures, 2 table
Optimizing Harris Corner Detection on GPGPUs Using CUDA
ABSTRACT
Optimizing Harris Corner Detection on GPGPUs Using CUDA
The objective of this thesis is to optimize the Harris corner detection algorithm implementation on NVIDIA GPGPUs using the CUDA software platform and measure the performance benefit. The Harris corner detection algorithm—developed by C. Harris and M. Stephens—discovers well defined corner points within an image. The corner detection implementation has been proven to be computationally intensive, thus realtime performance is difficult with a sequential software implementation. This thesis decomposes the Harris corner detection algorithm into a set of parallel stages, each of which are implemented and optimized on the CUDA platform. The performance results show that by applying strategic CUDA optimizations to the Harris corner detection implementation, realtime performance is feasible. The optimized CUDA implementation of the Harris corner detection algorithm showed significant speedup over several platforms: standard C, MATLAB, and OpenCV. The optimized CUDA implementation of the Harris corner detection algorithm was then applied to a feature matching computer vision system, which showed significant speedup over the other platforms
Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO
The recent introduction of powerful embedded graphics processing units (GPUs)
has allowed for unforeseen improvements in real-time computer vision
applications. It has enabled algorithms to run onboard, well above the standard
video rates, yielding not only higher information processing capability, but
also reduced latency. This work focuses on the applicability of efficient
low-level, GPU hardware-specific instructions to improve on existing computer
vision algorithms in the field of visual-inertial odometry (VIO). While most
steps of a VIO pipeline work on visual features, they rely on image data for
detection and tracking, of which both steps are well suited for
parallelization. Especially non-maxima suppression and the subsequent feature
selection are prominent contributors to the overall image processing latency.
Our work first revisits the problem of non-maxima suppression for feature
detection specifically on GPUs, and proposes a solution that selects local
response maxima, imposes spatial feature distribution, and extracts features
simultaneously. Our second contribution introduces an enhanced FAST feature
detector that applies the aforementioned non-maxima suppression method.
Finally, we compare our method to other state-of-the-art CPU and GPU
implementations, where we always outperform all of them in feature tracking and
detection, resulting in over 1000fps throughput on an embedded Jetson TX2
platform. Additionally, we demonstrate our work integrated in a VIO pipeline
achieving a metric state estimation at ~200fps.Comment: IEEE International Conference on Intelligent Robots and Systems
(IROS), 2020. Open-source implementation available at
https://github.com/uzh-rpg/vili
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
MASSIVELY PARALLEL ALGORITHMS FOR POINT CLOUD BASED OBJECT RECOGNITION ON HETEROGENEOUS ARCHITECTURE
With the advent of new commodity depth sensors, point cloud data processing plays an increasingly important role in object recognition and perception. However, the computational cost of point cloud data processing is extremely high due to the large data size, high dimensionality, and algorithmic complexity. To address the computational challenges of real-time processing, this work investigates the possibilities of using modern heterogeneous computing platforms and its supporting ecosystem such as massively parallel architecture (MPA), computing cluster, compute unified device architecture (CUDA), and multithreaded programming to accelerate the point cloud based object recognition. The aforementioned computing platforms would not yield high performance unless the specific features are properly utilized. Failing that the result actually produces an inferior performance. To achieve the high-speed performance in image descriptor computing, indexing, and matching in point cloud based object recognition, this work explores both coarse and fine grain level parallelism, identifies the acceptable levels of algorithmic approximation, and analyzes various performance impactors. A set of heterogeneous parallel algorithms are designed and implemented in this work. These algorithms include exact and approximate scalable massively parallel image descriptors for descriptor computing, parallel construction of k-dimensional tree (KD-tree) and the forest of KD-trees for descriptor indexing, parallel approximate nearest neighbor search (ANNS) and buffered ANNS (BANNS) on the KD-tree and the forest of KD-trees for descriptor matching. The results show that the proposed massively parallel algorithms on heterogeneous computing platforms can significantly improve the execution time performance of feature computing, indexing, and matching. Meanwhile, this work demonstrates that the heterogeneous computing architectures, with appropriate architecture specific algorithms design and optimization, have the distinct advantages of improving the performance of multimedia applications
- …