12 research outputs found

    FLECSim-SoC: A Flexible End-to-End Co-Design Simulation Framework for System on Chips

    Get PDF
    Hardware accelerators for deep neural networks (DNNs) have established themselves over the past decade. Most developments have worked towards higher efficiency with an individual application in mind. This highlights the strong relationship between co-designing the accelerator together with the requirements of the application. Currently for a structured design flow, however, it lacks a tool to evaluate a DNN accelerator embedded in a System on Chip (SoC) platform.To address this gap in the state of the art, we introduce FLECSim, a tool framework that enables an end-to-end simulation of an SoC with dedicated accelerators, CPUs and memories. FLECSim offers flexible configuration of the system and straightforward integration of new accelerator models in both SystemC and RTL, which allows for early design verification. During the simulation, FLECSim provides metrics of the SoC, which can be used to explore the design space. Finally, we present the capabilities of FLECSim, perform an exemplary evaluation with a systolic array-based accelerator and explore the design parameters in terms of accelerator size, power and performance

    Data Movement Reduction for DNN Accelerators: Enabling Dynamic Quantization Through an eFPGA

    Get PDF
    Computational requirements for deep neural networks (DNNs) have been on a rising trend for years. Moreover, network dataflows and topologies are becoming more sophisticated to address more challenging applications. DNN accelerators cannot adopt quickly to the constantly changing DNNs. In this paper, we describe our approach to make a static accelerator more versatile by adding an embedded FPGA (eFPGA). The eFPGA is tightly coupled to the on-chip network, which allows us to pass data through the eFPGA before and after it is processed by the DNN accelerator. Hence, the proposed solution is able to quickly address changing requirements. To show the benefits of this approach, we propose an eFPGA application that enables dynamic quantization of data. We can fit four number converters on an 1.5mm21.5mm^2 eFPGA, which can process 400M data elements per second. We will practically validate our work in the near future, with a SoC tapeout in the ongoing EPI project

    Embedded Image Processing the European Way: A new platform for the future automotive market

    Get PDF

    An Analytical Model of Configurable Systolic Arrays to find the Best-Fitting Accelerator for a given DNN Workload

    Get PDF
    Since their breakthrough, complexity of Deep Neural Networks (DNNs) is rising steadily. As a result, accelerators for DNNs are now used in many domains. However, designing and configuring an accelerator that meets the requirements of a given application perfectly is a challenging task. In this paper, we therefore present our approach to support the accelerator design process. With an analytical model of a systolic array we can estimate performance, energy consumption and area for each design option. To determine these metrics, usually a cycle accurate simulation is performed, which is a time-consuming task. Hence, the design space has to be restricted heavily. Analytical modelling, however, allows for fast evaluation of a design using a mathematical abstraction of the accelerator. For DNNs, this works especially well since the dataflow and memory accesses have high regularity. To show the correctness of our model, we perform an exemplary realization with the state-of-the-art systolic array generator Gemmini and compare it with a cycle accurate simulation and state-of-the-art modelling tools, showing less than 1% deviation. We also conducted a design space exploration, showing the analytical model’s capabilities to support an accelerator design. In a case study on ResNet-34, we can demonstrate that our model and DSE tool reduces the time to find the best-fitting solution by four or two orders of magnitude compared to a cycle-accurate simulation or state-of-the-art modelling tools, respectively

    EFFECT: An End-to-End Framework for Evaluating Strategies for Parallel AI Anomaly Detection

    Get PDF
    Neural networks achieve high accuracy in tasks like image recognition or segmentation. However, their application in safety-critical domains is limited due to their black-box nature and vulnerability to specific types of attacks. To mitigate this, methods detecting out-of-distribution or adversarial attacks in parallel to the network inference were introduced. These methods are hard to compare because they were developed for different use cases, datasets, and networks. To fill this gap, we introduce EFFECT, an end-to-end framework to evaluate and compare new methods for anomaly detection, without the need for retraining and by using traces of intermediate inference results. The presented workflow works with every preexisting neural network architecture and evaluates the considered anomaly detection methods in terms of accuracy and computational complexity. We demonstrate EFFECT\u27s capabilities, by creating new detectors for ShuffleNet and MobileNetV2 for anomaly detection as well as fault origin detection. EFFECT allows us to design an anomaly detector, based on the Mahalanobis distance as well as CNN based detectors. For both use cases, we achieve accuracies of over 85 %, classifying inferences as normal or abnormal, and thus beating existing methods

    CNNParted: An open source framework for efficient Convolutional Neural Network inference partitioning in embedded systems

    Get PDF
    Applications such as autonomous driving or assistive robotics heavily rely on the usage of Deep Neural Networks. In particular, Convolutional Neural Networks (CNNs) provide precise and reliable results in image processing tasks like camera-based object detection or semantic segmentation. However, to achieve even better results, CNNs are becoming more and more complex. Deploying these networks in distributed embedded systems thereby imposes new challenges, due to additional constraints regarding performance and energy consumption in the near-sensor compute platforms, i.e. the sensor nodes. Processing all data in the central node, however, is disadvantageous since raw data of camera consumes large bandwidth and running CNN inference of multiple tasks requires certain performance. Moreover, sending raw data over the interconnect is not advisable for privacy reasons. Hence, offloading CNN workload to the sensor nodes in the system can lead to reduced traffic on the link and a higher level of data security. However, due to the limited hardware-resources on the sensor nodes, partitioning CNNs has to be done carefully to meet overall latency requirements and energy constraints. Therefore, we present CNNParted, an open-source framework for efficient, hardware-aware CNN inference partitioning targeting embedded AI applications. It automatically searches for potential partitioning points in the CNN to find a beneficial workload distribution between sensor nodes and a central edge node. Thereby, CNNParted not only analyzes the CNN architecture but also takes hardware components, such as dedicated hardware accelerators and memories, into consideration to evaluate inference partitioning regarding latency and energy consumption. Exemplary, we apply CNNParted to three commonly used feed forward CNNs in embedded systems. Thereby, the framework first searches for several potential partitioning points and then evaluates the latter regarding inference latency and energy consumption. Based on the results, beneficial partitioning points can be identified depending on the system constraints. Using the framework, we are able to find and evaluate 10 potential partitioning points for FCN ResNet-50, 13 partitioning points for GoogLeNet, and 8 partitioning points for SqueezeNet V1.1 within 520 s, 330 s, and 140 s, respectively, on an AMD EPYC 7702P running 8 concurrent threads. For GoogLeNet, we determine two partitioning points that provide a good trade-off between required bandwidth, latency and energy consumption. We also provide insights into further interesting findings that can be derived from the evaluation results

    QUA³CK - A Machine Learning Development Process

    Get PDF
    Machine learning and data processing are trending topics at the moment. However, there is still alack of a standard process to support a fast, simple, and effective development of machine learningmodels for academia and industry combined. Processes such as KDD or CRISP-DM are highlyspecialized in data mining and business cases. Therefore, engineers often refer to individualapproaches to solve a machine learning problem. Especially in teaching, the lack of a standardprocess is a challenge. Students typically get a better understanding if a systematic approach tosolve problems is given to them. A challenge when formulating a machine learning developmentprocess is to provide standard actions that work on different use-cases. At the same time, it has tobe simple. Complex processes often lead to the wrong approach.The QUA³CK process was created at the Karlsruhe Institute of Technology to fill the gap inresearch and industry for a machine learning development process. However, the main focus wasto reach engineering students with an easy-to-remember, didactic way to solve machine learningproblems. This five-stage process starts with a machine learning question (Q), a problem thathas to be solved. Understanding the data (U) comes next. Then, the loop between selecting anAlgorithm (A), Adapting the features (A), and Adjusting the hyperparameters (A) is executeduntil the system is ready for Conclude and compare (C). At last, the Knowledge transfer (K) ofthe given solution can be realized as deployment in hardware or as a documentation.This paper describes the process and all individual steps in detail. Besides, we present severaluse-cases of QUA³CK in academia and research projects

    Towards reconfigurable accelerators in HPC: Designing a multipurpose eFPGA tile for heterogeneous SoCs

    Get PDF
    The goal of modern high performance computing platforms is to combine low power consumption and high throughput. Within the European Processor Initiative (EPI), such an SoC platform to meet the novel exascale requirements is built and investigated. As part of this project, we introduce an embedded Field Programmable Gate Array (eFPGA), adding flexibility to accelerate various workloads. In this article, we show our approach to design the eFPGA tile that supports the EPI SoC. While eFPGAs are inherently reconfigurable, their initial design has to be determined for tape-out. The design space of the eFPGA is explored and evaluated with different configurations of two HPC workloads, covering control and dataflow heavy applications. As a result, we present a well-balanced eFPGA design that can host several use cases and potential future ones by allocating 1% of the total EPI SoC area. Finally, our simulation results of the architectures on the eFPGA show great performance improvements over their software counterparts.European Processor Initiative (EPI) project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 826647, from Spanish Government (PID2019- 107255GB-C21/AEI /10.13039/501100011033), and from Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328). M. Moreto is partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Embedded Face Recognition for Personalized Services in the Assistive Robotics

    No full text
    Recently, the field of assistive robotics has drawn much attention in the health care sector. In combination with modern machine learning-supported person recognition systems, they can deliver highly personalized services. However, common algorithms for person recognition such as convolutional neural networks (CNNs) consume high amounts of power and show low energy efficiency when executed on general-purpose computing platforms. In this paper, we present our hardware architecture and field programmable gate array (FPGA) accelerator to enable on-device person recognition in the context of assistive robotics. Therefore, we optimize a neural network based on the SqueezeNet topology and implement it on an FPGA for a high degree of flexibility and reconfigurability. By pruning redundant filters and quantization of weights and activations, we are able to find a well-fitting neural network that achieves a high identification accuracy of 84%. On a Xilinx Zynq Ultra96v2, we achieve a power consumption of 4.8 W, a latency of 31 ms and an efficiency of 6.738 FPS/W. Our results outperform the latency by 1.6x compared to recent person recognition systems in assistive robots and energy efficiency by 1.7x for embedded face recognition, respectively
    corecore