427 research outputs found
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
Optimizing SIMD execution in HW/SW co-designed processors
SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators.
This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization.
Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations.
Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version
Real-time implementation of 3D LiDAR point cloud semantic segmentation in an FPGA
Dissertação de mestrado em Informatics EngineeringIn the last few years, the automotive industry has relied heavily on deep learning applications for
perception solutions. With data-heavy sensors, such as LiDAR, becoming a standard, the task of
developing low-power and real-time applications has become increasingly more challenging. To obtain
the maximum computational efficiency, no longer can one focus solely on the software aspect of such
applications, while disregarding the underlying hardware.
In this thesis, a hardware-software co-design approach is used to implement an inference application
leveraging the SqueezeSegV3, a LiDAR-based convolutional neural network, on the Versal ACAP VCK190
FPGA. Automotive requirements carefully drive the development of the proposed solution, with real-time
performance and low power consumption being the target metrics.
A first experiment validates the suitability of Xilinx’s Vitis-AI tool for the deployment of deep
convolutional neural networks on FPGAs. Both the ResNet-18 and SqueezeNet neural networks are
deployed to the Zynq UltraScale+ MPSoC ZCU104 and Versal ACAP VCK190 FPGAs. The results show
that both networks achieve far more than the real-time requirements while consuming low power.
Compared to an NVIDIA RTX 3090 GPU, the performance per watt during both network’s inference is 12x
and 47.8x higher and 15.1x and 26.6x higher respectively for the Zynq UltraScale+ MPSoC ZCU104 and
the Versal ACAP VCK190 FPGA. These results are obtained with no drop in accuracy in the quantization
step.
A second experiment builds upon the results of the first by deploying a real-time application containing
the SqueezeSegV3 model using the Semantic-KITTI dataset. A framerate of 11 Hz is achieved with a peak
power consumption of 78 Watts. The quantization step results in a minimal accuracy and IoU degradation
of 0.7 and 1.5 points respectively. A smaller version of the same model is also deployed achieving a
framerate of 19 Hz and a peak power consumption of 76 Watts. The application performs semantic
segmentation over all the point cloud with a field of view of 360°.Nos últimos anos a indústria automóvel tem cada vez mais aplicado deep learning para solucionar
problemas de perceção. Dado que os sensores que produzem grandes quantidades de dados, como o
LiDAR, se têm tornado standard, a tarefa de desenvolver aplicações de baixo consumo energético e com
capacidades de reagir em tempo real tem-se tornado cada vez mais desafiante. Para obter a máxima
eficiência computacional, deixou de ser possível focar-se apenas no software aquando do
desenvolvimento de uma aplicação deixando de lado o hardware subjacente.
Nesta tese, uma abordagem de desenvolvimento simultâneo de hardware e software é usada para
implementar uma aplicação de inferência usando o SqueezeSegV3, uma rede neuronal convolucional
profunda, na FPGA Versal ACAP VCK190. São os requisitos automotive que guiam o desenvolvimento da
solução proposta, sendo a performance em tempo real e o baixo consumo energético, as métricas alvo
principais.
Uma primeira experiência valida a aptidão da ferramenta Vitis-AI para a implantação de redes
neuronais convolucionais profundas em FPGAs. As redes ResNet-18 e SqueezeNet são ambas
implantadas nas FPGAs Zynq UltraScale+ MPSoC ZCU104 e Versal ACAP VCK190. Os resultados
mostram que ambas as redes ultrapassam os requisitos de tempo real consumindo pouca energia.
Comparado com a GPU NVIDIA RTX 3090, a performance por Watt durante a inferência de ambas as
redes é superior em 12x e 47.8x e 15.1x e 26.6x respetivamente na Zynq UltraScale+ MPSoC ZCU104
e na Versal ACAP VCK190. Estes resultados foram obtidos sem qualquer perda de accuracy na etapa de
quantização.
Uma segunda experiência é feita no seguimento dos resultados da primeira, implantando uma
aplicação de inferência em tempo real contendo o modelo SqueezeSegV3 e usando o conjunto de dados
Semantic-KITTI. Um framerate de 11 Hz é atingido com um pico de consumo energético de 78 Watts. O
processo de quantização resulta numa perda mínima de accuracy e IoU com valores de 0.7 e 1.5 pontos
respetivamente. Uma versão mais pequena do mesmo modelo é também implantada, atingindo uma
framerate de 19 Hz e um pico de consumo energético de 76 Watts. A aplicação desenvolvida executa
segmentação semântica sobre a totalidade das nuvens de pontos LiDAR, com um campo de visão de
360°
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
- …