22 research outputs found

    Hybrid Performance Prediction Models for Fully-Connected Neural Networks on MPSoC

    Get PDF
    Predicting the performance of Artificial Neural Networks (ANNs) on embedded multi-core platforms is tedious. Concurrent accesses to shared resources are hard to model due to congestion effects on the shared communication medium, which affect the performance of the application. In this paper we present a hybrid modeling environment to enable fast yet accurate timing prediction for fully-connected ANNs deployed on multi-core platforms. The modeling flow is based on the integration of an analytical computation time model with a communication time model which are both calibrated through measurement inside a system level simulation using SystemC. The proposed flow enables the prediction of the end-to-end latency for different mappings of several fully-connected ANNs with an average of more than 99 % accuracy

    Setup of an Experimental Framework for Performance Modeling and Prediction of Embedded Multicore AI Architectures

    Get PDF
    Evaluation of performance for complex applications such as Artificial Intelligence (AI) algorithms and more specifically neural networks on Multi-Processor Systems on a Chip (MPSoC) is tedious. Finding an optimized partitioning of the application while predicting accurately the latency induced by communication bus congestion, is hard using traditional analysis methods. This document presents a performance prediction worklow based on SystemC simulation models for timing prediction of neural networks on MPSoC

    FPGA based in-memory AI computing

    Get PDF
    The advent of AI in vehicles of all kinds is simultaneously creating the need for more and most often also very large computing capacities. Depending on the type of vehicle, this gives rise to various problems: while overall hardware and engineering costs dominate for airplanes, in fully electrical cars the costs for computing hardware are more of a matter. Common in both domains are tight requirements on the size, weight and space of the hardware, especially for drones and satellites, where this is most challenging. For airplanes and especially for satellites, an additional challenge is the radiation resistance of the usually very memory-intensive AI systems. We therefore propose an FPGA-based in-memory AI computation methodology, which is so far only applicable for small AI systems, but works exclusively with the local memory elements of FPGAs: lookup tables (LUTs) and registers. By not using external and thus slow, inefficient and radiation-sensitive DRAM, but only local SRAM, we can make AI systems faster, lighter and more efficient than is possible with conventional GPUs or AI accelerators. All known radiation hardening techniques for FPGAs also work for our systems

    Accelerating and pruning CNNs for semantic segmentation on FPGA

    Get PDF
    Semantic segmentation is one of the popular tasks in computer vision, providing pixel-wise annotations for scene understanding. However, segmentation-based convolutional neural networks require tremendous computational power. In this work, a fully-pipelined hardware accelerator with support for dilated convolution is introduced, which cuts down the redundant zero multiplications. Furthermore, we propose a genetic algorithm based automated channel pruning technique to jointly optimize computational complexity and model accuracy. Finally, hardware heuristics and an accurate model of the custom accelerator design enable a hardware-aware pruning framework. We achieve 2.44X lower latency with minimal degradation in semantic prediction quality (−1.98 pp lower mean intersection over union) compared to the baseline DeepLabV3+ model, evaluated on an Arria-10 FPGA. The binary files of the FPGA design, baseline and pruned models can be found in github.com/pierpaolomori/SemanticSegmentationFPGA

    Using Network Architecture Search for Optimizing Tensor Compression

    No full text
    In this work we propose to use Network Architecture Search (NAS) for controlling the per layer parameters of a Tensor Compression (TC) algorithm using Tucker decomposition in order to optimize a given convolutional neural network for its parameter count and thus inference performance on embedded systems. TC enables a quick generation of the next instance in the NAS process, avoiding the need for a time consuming full training after each step. We show that this approach is more eficient than conventional NAS and can outperform all TC heuristics reported so far. Nevertheless it is still a very time consuming process, finding a good solution in the vast search space of layer-wise TC. We show that, it is possible to reduce the parameter size upto 85% for the cost of 0.1- 1% of Top-1 accuracy on our vision processing benchmarks. Further, it is shown that the compressed model occupies just 20% of the original memory size which is required for storing the entire uncompressed model, with an increase in the inference speed of upto 2.5 times without much loss in the performance indicating potential gains for embedded systems

    Using Network Architecture Search for Optimizing Tensor Compression

    Get PDF
    In this work we propose to use Network Architecture Search (NAS) for controlling the per layer parameters of a Tensor Compression (TC) algorithm using Tucker decomposition in order to optimize a given convolutional neural network for its parameter count and thus inference performance on embedded systems. TC enables a quick generation of the next instance in the NAS process, avoiding the need for a time consuming full training after each step. We show that this approach is more effcient than conventional NAS and can outperform all TC heuristics reported so far. Nevertheless it is still a very time consuming process, ending a good solution in the vast search space of layer-wise TC. We show that, it is possible to reduce the parameter size upto 85% for the cost of 0.1- 1% of Top-1 accuracy on our vision processing benchmarks. Further, it is shown that the compressed model occupies just 20% of the original memory size which is required for storing the entire uncompressed model, with an increase in the inference speed of upto 2.5 times without much loss in the performance indicating potential gains for embedded systems

    Using Network Architecture Search for Optimizing Tensor Compression

    Get PDF
    In this work we propose to use Network Architecture Search (NAS) for controlling the per layer parameters of a Tensor Compression (TC) algorithm using Tucker decomposition in order to optimize a given convolutional neural network for its parameter count and thus inference performance on embedded systems. TC enables a quick generation of the next instance in the NAS process, avoiding the need for a time consuming full training after each step. We show that this approach is more effcient than conventional NAS and can outperform all TC heuristics reported so far. Nevertheless it is still a very time consuming process, ending a good solution in the vast search space of layer-wise TC. We show that, it is possible to reduce the parameter size upto 85% for the cost of 0.1- 1% of Top-1 accuracy on our vision processing benchmarks. Further, it is shown that the compressed model occupies just 20% of the original memory size which is required for storing the entire uncompressed model, with an increase in the inference speed of upto 2.5 times without much loss in the performance indicating potential gains for embedded systems

    A Comparison of Approaches for High-level Power Estimation of LUT-based DSP Components

    No full text
    Abstract We compare two approaches for high-level power estimation of DSP component

    Hardware Execution Time Prediction for Neural Network Layers

    No full text
    Abstract. We present an estimation methodology, accurately predicting the execution time for a given embedded Artificial Intelligence (AI) accelerator and a neural network (NN) under analysis. The timing prediction is implemented as a python library called Model of Neural Network Execution Time (MONNET) and is able to perform its predictions analyzing the Keras description of an NN under test within milliseconds. This enables several techniques to design NNs for embedded hardware. Designers can avoid training networks which could be functionally sufficient but will likely fail the timing requirements. The technique can also be included into automated network architecture search algorithms, enabling exact hardware execution times to become one contributor to the search’s target function. In order to perform precise estimations for a target hardware, each new hardware needs to undergo an initial automatic characterization process, using tens of thousands of different small NNs. This process may need several days, depending on the hardware. We tested our methodology for the Intel Neural Compute Stick 2, where we could achieve an root mean squared percentage error (RMSPE) below 21 % for a large range of industry relevant NNs from vision processing
    corecore