220 research outputs found
Hardware-accelerated data decoding and reconstruction for automotive LiDAR sensors
The automotive industry is facing an unprecedented
technological transformation towards fully autonomous vehicles.
Optimists predict that, by 2030, cars will be sufficiently reliable,
affordable, and common to displace most current human driving
tasks. To cope with these trends, autonomous vehicles require
reliable perception systems to hear and see all the surroundings,
being light detection and ranging (LiDAR) sensors a key instrument for recreating a 3D visualization of the world. However,
for a reliable operation, such systems require LiDAR sensors to
provide high-resolution 3D representations of the car’s vicinity,
which results in millions of data points to be processed in
real-time. With this article we propose the ALFA-Pi, a data
packet decoder and reconstruction system fully deployed on
an embedded reconfigurable hardware platform. By resorting
to field-programmable gate array (FPGA) technology, ALFAPi is able to interface different LiDAR sensors at the same
time, while providing custom representation outputs to high-level
perception systems. By accelerating the LiDAR interface, the
proposed system outperforms current software-only approaches,
achieving lower latency in the data acquisition and data decoding
tasks while reaching high performance ratios
DPTC -- an FPGA-based trace compression
Recording of flash-ADC traces is challenging from both the transmission
bandwidth and storage cost perspectives. This paper presents a
configuration-free lossless compression algorithm which addresses both
limitations, by compressing the data on-the-fly in the controlling
field-programmable gate array (FPGA). Thus the difference predicted trace
compression (DPTC) can easily be used directly in front-end electronics. The
method first computes the differences between consecutive samples in the
traces, thereby concentrating the most probable values around zero. The values
are then stored as groups of four, with only the necessary least-significant
bits in a variable-length code, packed in a stream of 32-bit words. To evaluate
the efficiency, the storage cost of compressed traces is modeled as a baseline
cost including the ADC noise, and a cost for pulses that depends on their
amplitude and width. The free parameters and the validity of the model are
determined by comparing it with the results of compressing a large set of
artificial traces with varying characteristics. The compression method was also
applied to actual data from different types of detectors, thereby demonstrating
its general applicability. The compression efficiency is found to be comparable
to popular general-purpose compression methods, while available for FPGA
implementation using limited resources. A typical storage cost is around 4 to 5
bits per sample. Code for the FPGA implementation in VHDL and for the CPU
decompression routine in C of DPTC are available as open source software, both
operating at multi-100 Msamples/s speeds.Comment: 9 pages, 7 figure
ReS2tAC -- UAV-Borne Real-Time SGM Stereo Optimized for Embedded ARM and CUDA Devices
With the emergence of low-cost robotic systems, such as unmanned aerial
vehicle, the importance of embedded high-performance image processing has
increased. For a long time, FPGAs were the only processing hardware that were
capable of high-performance computing, while at the same time preserving a low
power consumption, essential for embedded systems. However, the recently
increasing availability of embedded GPU-based systems, such as the NVIDIA
Jetson series, comprised of an ARM CPU and a NVIDIA Tegra GPU, allows for
massively parallel embedded computing on graphics hardware. With this in mind,
we propose an approach for real-time embedded stereo processing on ARM and
CUDA-enabled devices, which is based on the popular and widely used Semi-Global
Matching algorithm. In this, we propose an optimization of the algorithm for
embedded CUDA GPUs, by using massively parallel computing, as well as using the
NEON intrinsics to optimize the algorithm for vectorized SIMD processing on
embedded ARM CPUs. We have evaluated our approach with different configurations
on two public stereo benchmark datasets to demonstrate that they can reach an
error rate as low as 3.3%. Furthermore, our experiments show that the fastest
configuration of our approach reaches up to 46 FPS on VGA image resolution.
Finally, in a use-case specific qualitative evaluation, we have evaluated the
power consumption of our approach and deployed it on the DJI Manifold 2-G
attached to a DJI Matrix 210v2 RTK unmanned aerial vehicle (UAV), demonstrating
its suitability for real-time stereo processing onboard a UAV
Embedded System Optimization of Radar Post-processing in an ARM CPU Core
Algorithms executed on the radar processor system contributes to a significant performance bottleneck of the overall radar system. One key performance concern is
the latency in target detection when dealing with hard deadline systems. Research has shown software optimization as one major contributor to radar system performance
improvements. This thesis aims at software optimizations using a manual and automatic approach and analyzing the results to make informed future decisions
while working with an ARM processor system. In order to ascertain an optimized implementation, a question put forward was whether the algorithms on the ARM
processor could work with a 6-antenna implementation without a decline in the performance. However, an answer would also help project how many additional
algorithms can still be added without performance decline.
The manual optimization was done based on the quantitative analysis of the software execution time. The manual optimization approach looked at the vectorization
strategy using the NEON vector register on the ARM CPU to reimplement the initial Constant False Alarm Rate(CFAR) Detection algorithm. An additional
optimization approach was eliminating redundant loops while going through the Range Gates and Doppler filters. In order to determine the best compiler for automatic
code optimization for the radar algorithms on the ARM processor, the GCC and Clang compilers were used to compile the initial algorithms and the optimized
implementation on the radar post-processing stage.
Analysis of the optimization results showed that it is possible to run the radar post-processing algorithms on the ARM processor at the 6-antenna implementation
without system load stress. In addition, the results show an excellent headroom margin based on the defined scenario. The result analysis further revealed that the
effect of dynamic memory allocation could not be underrated in situations where performance is a significant concern. Additional statements from the result demonstrated
that the GCC and Clang compiler has their strength and weaknesses when used in the compilation. One limiting factor to note on the optimization using the
NEON register is the sample size’s effect on the optimization implementation. Although it fits into the test samples used based on the defined scenario, there might
be varying results in varying window cell size situations that might not necessarily improve the time constraints
- …