Search CORE

22 research outputs found

Hybrid Performance Prediction Models for Fully-Connected Neural Networks on MPSoC

Author: Dariol Quentin
Grüttner Kim
Helms Domenik
Le Nours Sebastien
Pillement Sebastien
Stemmer Ralf
Publication venue
Publication date: 01/01/2022
Field of study

Predicting the performance of Artificial Neural Networks (ANNs) on embedded multi-core platforms is tedious. Concurrent accesses to shared resources are hard to model due to congestion effects on the shared communication medium, which affect the performance of the application. In this paper we present a hybrid modeling environment to enable fast yet accurate timing prediction for fully-connected ANNs deployed on multi-core platforms. The modeling flow is based on the integration of an analytical computation time model with a communication time model which are both calibrated through measurement inside a system level simulation using SystemC. The proposed flow enables the prediction of the end-to-end latency for different mappings of several fully-connected ANNs with an average of more than 99 % accuracy

Institute of Transport Research:Publications

Setup of an Experimental Framework for Performance Modeling and Prediction of Embedded Multicore AI Architectures

Author: Dariol Quentin
Grüttner Kim
Helms Domenik
Le Nours Sebastien
Pillement Sebastien
Stemmer Ralf
Publication venue
Publication date: 01/01/2022
Field of study

Evaluation of performance for complex applications such as Artificial Intelligence (AI) algorithms and more specifically neural networks on Multi-Processor Systems on a Chip (MPSoC) is tedious. Finding an optimized partitioning of the application while predicting accurately the latency induced by communication bus congestion, is hard using traditional analysis methods. This document presents a performance prediction worklow based on SystemC simulation models for timing prediction of neural networks on MPSoC

Institute of Transport Research:Publications

FPGA based in-memory AI computing

Author: Dariol Quentin
Einhaus Lukas
Gregor Schiele
Grüttner Kim
Helms Domenik
Perjikolaei Behnam R
Publication venue
Publication date: 30/05/2023
Field of study

The advent of AI in vehicles of all kinds is simultaneously creating the need for more and most often also very large computing capacities. Depending on the type of vehicle, this gives rise to various problems: while overall hardware and engineering costs dominate for airplanes, in fully electrical cars the costs for computing hardware are more of a matter. Common in both domains are tight requirements on the size, weight and space of the hardware, especially for drones and satellites, where this is most challenging. For airplanes and especially for satellites, an additional challenge is the radiation resistance of the usually very memory-intensive AI systems. We therefore propose an FPGA-based in-memory AI computation methodology, which is so far only applicable for small AI systems, but works exclusively with the local memory elements of FPGAs: lookup tables (LUTs) and registers. By not using external and thus slow, inefficient and radiation-sensitive DRAM, but only local SRAM, we can make AI systems faster, lighter and more efficient than is possible with conventional GPUs or AI accelerators. All known radiation hardening techniques for FPGAs also work for our systems

Institute of Transport Research:Publications

Accelerating and pruning CNNs for semantic segmentation on FPGA

Author: Fasfous Nael
Frickenstein Alexander
Frickenstein Lukas
Helms Domenik
Mitra Saptarshi
Mori Pierpaolo
Nagaraja Naveen-Shankar
Passerone Claudio
Sarkar Sreetama
Stechele Walter
Vemparala Manoj-Rohit
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2022
Field of study

Semantic segmentation is one of the popular tasks in computer vision, providing pixel-wise annotations for scene understanding. However, segmentation-based convolutional neural networks require tremendous computational power. In this work, a fully-pipelined hardware accelerator with support for dilated convolution is introduced, which cuts down the redundant zero multiplications. Furthermore, we propose a genetic algorithm based automated channel pruning technique to jointly optimize computational complexity and model accuracy. Finally, hardware heuristics and an accurate model of the custom accelerator design enable a hardware-aware pruning framework. We achieve 2.44X lower latency with minimal degradation in semantic prediction quality (−1.98 pp lower mean intersection over union) compared to the baseline DeepLabV3+ model, evaluated on an Arria-10 FPGA. The binary files of the FPGA design, baseline and pruned models can be found in github.com/pierpaolomori/SemanticSegmentationFPGA

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Using Network Architecture Search for Optimizing Tensor Compression

Author: Helms Domenik
Thirunavukkarasu Arunachalam
Publication venue: Springer
Publication date: 11/06/2023
Field of study

In this work we propose to use Network Architecture Search (NAS) for controlling the per layer parameters of a Tensor Compression (TC) algorithm using Tucker decomposition in order to optimize a given convolutional neural network for its parameter count and thus inference performance on embedded systems. TC enables a quick generation of the next instance in the NAS process, avoiding the need for a time consuming full training after each step. We show that this approach is more eficient than conventional NAS and can outperform all TC heuristics reported so far. Nevertheless it is still a very time consuming process, finding a good solution in the vast search space of layer-wise TC. We show that, it is possible to reduce the parameter size upto 85% for the cost of 0.1- 1% of Top-1 accuracy on our vision processing benchmarks. Further, it is shown that the compressed model occupies just 20% of the original memory size which is required for storing the entire uncompressed model, with an increase in the inference speed of upto 2.5 times without much loss in the performance indicating potential gains for embedded systems

Institute of Transport Research:Publications

Using Network Architecture Search for Optimizing Tensor Compression

Author: Helms Domenik
Thirunavukkarasu Arunachalam
Publication venue: Springer International Publishing AG
Publication date: 01/01/2023
Field of study

In this work we propose to use Network Architecture Search (NAS) for controlling the per layer parameters of a Tensor Compression (TC) algorithm using Tucker decomposition in order to optimize a given convolutional neural network for its parameter count and thus inference performance on embedded systems. TC enables a quick generation of the next instance in the NAS process, avoiding the need for a time consuming full training after each step. We show that this approach is more effcient than conventional NAS and can outperform all TC heuristics reported so far. Nevertheless it is still a very time consuming process, ending a good solution in the vast search space of layer-wise TC. We show that, it is possible to reduce the parameter size upto 85% for the cost of 0.1- 1% of Top-1 accuracy on our vision processing benchmarks. Further, it is shown that the compressed model occupies just 20% of the original memory size which is required for storing the entire uncompressed model, with an increase in the inference speed of upto 2.5 times without much loss in the performance indicating potential gains for embedded systems

Institute of Transport Research:Publications

Using Network Architecture Search for Optimizing Tensor Compression

Author: Helms Domenik
Thirunavukkarasu Arunachalam
Publication venue: Springer International Publishing AG
Publication date: 01/01/2022
Field of study

In this work we propose to use Network Architecture Search (NAS) for controlling the per layer parameters of a Tensor Compression (TC) algorithm using Tucker decomposition in order to optimize a given convolutional neural network for its parameter count and thus inference performance on embedded systems. TC enables a quick generation of the next instance in the NAS process, avoiding the need for a time consuming full training after each step. We show that this approach is more effcient than conventional NAS and can outperform all TC heuristics reported so far. Nevertheless it is still a very time consuming process, ending a good solution in the vast search space of layer-wise TC. We show that, it is possible to reduce the parameter size upto 85% for the cost of 0.1- 1% of Top-1 accuracy on our vision processing benchmarks. Further, it is shown that the compressed model occupies just 20% of the original memory size which is required for storing the entire uncompressed model, with an increase in the inference speed of upto 2.5 times without much loss in the performance indicating potential gains for embedded systems

Institute of Transport Research:Publications

A Comparison of Approaches for High-level Power Estimation of LUT-based DSP Components

Author: Carlos Carreras
Domenik Helms
Ruzica Jevtic
Publication venue
Publication date: 24/04/2020
Field of study

Abstract We compare two approaches for high-level power estimation of DSP component

CiteSeerX

Hardware Execution Time Prediction for Neural Network Layers

Author: Droste-Rehling Julian
Helms Domenik
Osterwind Adrian
Vemparala Manoj Rohit
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2022
Field of study

Abstract. We present an estimation methodology, accurately predicting the execution time for a given embedded Artificial Intelligence (AI) accelerator and a neural network (NN) under analysis. The timing prediction is implemented as a python library called Model of Neural Network Execution Time (MONNET) and is able to perform its predictions analyzing the Keras description of an NN under test within milliseconds. This enables several techniques to design NNs for embedded hardware. Designers can avoid training networks which could be functionally sufficient but will likely fail the timing requirements. The technique can also be included into automated network architecture search algorithms, enabling exact hardware execution times to become one contributor to the search’s target function. In order to perform precise estimations for a target hardware, each new hardware needs to undergo an initial automatic characterization process, using tens of thousands of different small NNs. This process may need several days, depending on the hardware. We tested our methodology for the Intel Neural Compute Stick 2, where we could achieve an root mean squared percentage error (RMSPE) below 21 % for a large range of industry relevant NNs from vision processing

Institute of Transport Research:Publications