1,448 research outputs found

    A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems

    Full text link
    Recent technological advances have greatly improved the performance and features of embedded systems. With the number of just mobile devices now reaching nearly equal to the population of earth, embedded systems have truly become ubiquitous. These trends, however, have also made the task of managing their power consumption extremely challenging. In recent years, several techniques have been proposed to address this issue. In this paper, we survey the techniques for managing power consumption of embedded systems. We discuss the need of power management and provide a classification of the techniques on several important parameters to highlight their similarities and differences. This paper is intended to help the researchers and application-developers in gaining insights into the working of power management techniques and designing even more efficient high-performance embedded systems of tomorrow

    Maximizing CNN Accelerator Efficiency Through Resource Partitioning

    Full text link
    Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x

    Efficiency analysis methodology of FPGAs based on lost frequencies, area and cycles

    Get PDF
    We propose a methodology to study and to quantify efficiency and the impact of overheads on runtime performance. Most work on High-Performance Computing (HPC) for FPGAs only studies runtime performance or cost, while we are interested in how far we are from peak performance and, more importantly, why. The efficiency of runtime performance is defined with respect to the ideal computational runtime in absence of inefficiencies. The analysis of the difference between actual and ideal runtime reveals the overheads and bottlenecks. A formal approach is proposed to decompose the efficiency into three components: frequency, area and cycles. After quantification of the efficiencies, a detailed analysis has to reveal the reasons for the lost frequencies, lost area and lost cycles. We propose a taxonomy of possible causes and practical methods to identify and quantify the overheads. The proposed methodology is applied on a number of use cases to illustrate the methodology. We show the interaction between the three components of efficiency and show how bottlenecks are revealed

    LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning

    Full text link
    Machine learning has recently gained traction as a way to overcome the slow accelerator generation and implementation process on an FPGA. It can be used to build performance and resource usage models that enable fast early-stage design space exploration. First, training requires large amounts of data (features extracted from design synthesis and implementation tools), which is cost-inefficient because of the time-consuming accelerator design and implementation process. Second, a model trained for a specific environment cannot predict performance or resource usage for a new, unknown environment. In a cloud system, renting a platform for data collection to build an ML model can significantly increase the total-cost-ownership (TCO) of a system. Third, ML-based models trained using a limited number of samples are prone to overfitting. To overcome these limitations, we propose LEAPER, a transfer learning-based approach for prediction of performance and resource usage in FPGA-based systems. The key idea of LEAPER is to transfer an ML-based performance and resource usage model trained for a low-end edge environment to a new, high-end cloud environment to provide fast and accurate predictions for accelerator implementation. Experimental results show that LEAPER (1) provides, on average across six workloads and five FPGAs, 85% accuracy when we use our transferred model for prediction in a cloud environment with 5-shot learning and (2) reduces design-space exploration time for accelerator implementation on an FPGA by 10x, from days to only a few hours

    Efficient FPSoC Prototyping of FCS-MPC for Three-Phase Voltage Source Inverters

    Get PDF
    This work describes an efficient implementation in terms of computation time and resource usage in a Field-Programmable System-On-Chip (FPSoC) of a Finite Control Set Model Predictive Control (FCS-MPC) algorithm. As an example, the FCS-MPC implementation is used for the current reference tracking of a two-level three-phase power converter. The proposed solution is an enabler for using both complex control algorithms and digital controllers for high switching frequency semiconductor technologies. An original HW/SW (hardware and software) system architecture for an FPSoC is designed to take advantage of a modern operating system, while removing time uncertainty in real-time software tasks, and exploiting dedicated FPGA fabric for the most complex computations. In addition, two different architectures for the FPGA-implemented functionality are proposed and compared in order to study the area-speed trade-off. Experimental results show the feasibility of the proposed implementation, which achieves a speed hundreds of times faster than the conventional Digital Signal Processor (DSP)-based control platform.Ministerio de Economía y Competitividad TEC2016-78430-RFondo Nacional de Investigación de Qatar NPRP 9-310-2-13

    High-Level Design for Ultra-Fast Software Defined Radio Prototyping on Multi-Processors Heterogeneous Platforms

    Get PDF
    International audienceThe design of Software Defined Radio (SDR) equipments (terminals, base stations, etc.) is still very challenging. We propose here a design methodology for ultra-fast prototyping on heterogeneous platforms made of GPPs (General Purpose Processors), DSPs (Digital Signal Processors) and FPGAs (Field Programmable Gate Array). Lying on a component-based approach, the methodology mainly aims at automating as much as possible the design from an algorithmic validation to a multi-processing heterogeneous implementation. The proposed methodology is based on the SynDEx CAD design approach, which was originally dedicated to multi-GPPs networks. We show how this was changed so that it is made appropriate with an embedded context of DSP. The implication of FPGAs is then addressed and integrated in the design approach with very little restrictions. Apart from a manual HW/SW partitioning, all other operations may be kept automatic in a heterogeneous processing context. The targeted granularity of the components, which are to be assembled in the design flow, is roughly the same size as that of a FFT, a filter or a Viterbi decoder for instance. The re-use of third party or pre-developed IPs is a basis for this design approach. Thanks to the proposed design methodology it is possible to port "ultra" fast a radio application over several platforms. In addition, the proposed design methodology is not restricted to SDR equipment design, and can be useful for any real-time embedded heterogeneous design in a prototyping context

    FPGA design methodology for industrial control systems—a review

    Get PDF
    This paper reviews the state of the art of fieldprogrammable gate array (FPGA) design methodologies with a focus on industrial control system applications. This paper starts with an overview of FPGA technology development, followed by a presentation of design methodologies, development tools and relevant CAD environments, including the use of portable hardware description languages and system level programming/design tools. They enable a holistic functional approach with the major advantage of setting up a unique modeling and evaluation environment for complete industrial electronics systems. Three main design rules are then presented. These are algorithm refinement, modularity, and systematic search for the best compromise between the control performance and the architectural constraints. An overview of contributions and limits of FPGAs is also given, followed by a short survey of FPGA-based intelligent controllers for modern industrial systems. Finally, two complete and timely case studies are presented to illustrate the benefits of an FPGA implementation when using the proposed system modeling and design methodology. These consist of the direct torque control for induction motor drives and the control of a diesel-driven synchronous stand-alone generator with the help of fuzzy logic

    HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

    Full text link
    Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset. The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The Verilog samples are generated with a variety of directives including loop unroll, loop pipeline and array partition to make sure optimized and realistic designs are covered. The total number of generated Verilog samples is nearly 9,000 per FPGA type. To demonstrate the effectiveness of our dataset, we undertake case studies to perform power estimation and resource usage estimation with ML models trained with our dataset. All the codes and dataset are public at the github repo.We believe that HLSDataset can save valuable time for researchers by avoiding the tedious process of running tools, scripting and parsing files to generate the dataset, and enable them to spend more time where it counts, that is, in training ML models.Comment: 8 pages, 5 figure
    corecore