35 research outputs found

    A Survey and Evaluation of FPGA High-Level Synthesis Tools

    Get PDF
    High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today's system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources

    Code transformations based on speculative SDC scheduling

    Get PDF
    Code motion and speculations are usually exploited in the High Level Synthesis of control dominated applications to improve the performances of the synthesized designs. Selecting the transformations to be applied is not a trivial task: their effects can indeed indirectly spread across the whole design, potentially worsening the quality of the results. In this paper we propose a code transformation flow, based on a new extension of the System of Difference Constraints (SDC) scheduling algorithm, which introduces a large number of transformations, whose profitability is guaranteed by SDC formulation. Experimental results show that the proposed technique in average reduces the execution time of control dominated applications by 37% with respect to a commercial tool without increasing the area usage

    Optimal Checkpointing for Secure Intermittently-Powered IoT Devices

    Full text link
    Energy harvesting is a promising solution to power Internet of Things (IoT) devices. Due to the intermittent nature of these energy sources, one cannot guarantee forward progress of program execution. Prior work has advocated for checkpointing the intermediate state to off-chip non-volatile memory (NVM). Encrypting checkpoints addresses the security concern, but significantly increases the checkpointing overheads. In this paper, we propose a new online checkpointing policy that judiciously determines when to checkpoint so as to minimize application time to completion while guaranteeing security. Compared to state-of-the-art checkpointing schemes that do not account for the overheads of encrypted checkpoints we improve execution time up to 1.4x.Comment: ICCAD 201

    Flot de conception automatique pour circuits commutables

    No full text
    National audienceLes FPGA, ou puces reconfigurables, n’ont pas cessé d’évoluer depuis leur création et sont désormais utilisés dans des systèmes complets (Xilinx Zynq ou Altera Stratix). Malgré tout, il reste de nombreux champs applicatifs desquels ils sont absents, et à tort. Utiliser les FPGA de manière plus intense au sein de systèmes complets est possible, mais il faut pour cela développer les capacités multi-utilisateurs de ces plateformes. Donner la capacité à une application s’exécutant sur un FPGA de se stopper pour, par exemple, laisser s’exécuter d’autres applications jugées prioritaires est particulièrement intéressant. Une telle action est qualifiée de « changement de contexte » (en anglais context-switch).Dans cet article, nous présentons une méthode et un outil permettant de donner cette capacité à des circuits fonctionnant sur cible reconfigurable. Le flot de conception présenté s’appuie sur un logiciel de synthèse de haut niveau et offre automatiquement la capacité de commutation aux circuits synthétisés. Les expériences menées sur un panel de circuits classiques montrent que l’ajout de cette capacité à un coût relativement faible ainsi qu’une rapidité de commutation sans égale dans la littérature

    System-Level Leakage Power Estimation Model for ASIC Designs

    Get PDF
    With advances in CMOS- technology and sub-micron process, leakage power dissipation has become a critical design metric. To incorporate more functions, designs are getting complex, thereby increases leakage power dissipation. Low power design objective requires early exploration and estimation. In this paper, we present the power estimation models for ASIC (Application Specific Integrated Circuit) based designs at the C-level of abstraction. The method includes analysis and extraction of the application specific information from the LLVM (Low-Level Virtual Machine) bit-code; which further applies to train the neural network. The trained model is applied in the estimation of the leakage power. Estimation of design power using our models is compared to the implemented measurement, which demonstrates its accuracy. In addition, the proposed methodology is significantly quicker and abolishes the need of synthesis based exploration

    Performance and area evaluations of processor-based benchmarks on FPGA devices

    Get PDF
    The computing system on SoCs is being long-term research since the FPGA technology has emerged due to its personality of re-programmable fabric, reconfigurable computing, and fast development time to market. During the last decade, uni-processor in a SoC is no longer to deal with the high growing market for complex applications such as Mobile Phones audio and video encoding, image and network processing. Due to the number of transistors on a silicon wafer is increasing, the recent FPGAs or embedded systems are advancing toward multi-processor-based design to meet tremendous performance and benefit this kind of systems are possible. Therefore, is an upcoming age of the MPSoC. In addition, most of the embedded processors are soft-cores, because they are flexible and reconfigurable for specific software functions and easy to build homogenous multi-processor systems for parallel programming. Moreover, behavioural synthesis tools are becoming a lot more powerful and enable to create datapath of logic units from high-level algorithms such as C to HDL and available for partitioning a HW/SW concurrent methodology. A range of embedded processors is able to implement on a FPGA-based prototyping to integrate the CPUs on a programmable device. This research is, firstly represent different types of computer architectures in modern embedded processors that are followed in different type of software applications (eg. Multi-threading Operations or Complex Functions) on FPGA-based SoCs; and secondly investigate their capability by executing a wide-range of multimedia software codes (Integer-algometric only) in different models of the processor-systems (uni-processor or multi-processor or Co-design), and finally compare those results in terms of the benchmarks and resource utilizations within FPGAs. All the examined programs were written in standard C and executed in a variety numbers of soft-core processors or hardware units to obtain the execution times. However, the number of processors and their customizable configuration or hardware datapath being generated are limited by a target FPGA resource, and designers need to understand the FPGA-based tradeoffs that have been considered - Speed versus Area. For this experimental purpose, I defined benchmarks into DLP / HLS catalogues, which are "data" and "function" intensive respectively. The programs of DLP will be executed in LEON3 MP and LE1 CMP multi-processor systems and the programs of HLS in the LegUp Co-design system on target FPGAs. In preliminary, the performance of the soft-core processors will be examined by executing all the benchmarks. The whole story of this thesis work centres on the issue of the execute times or the speed-up and area breakdown on FPGA devices in terms of different programs

    An empirical evaluation of High-Level Synthesis languages and tools for database acceleration

    Get PDF
    High Level Synthesis (HLS) languages and tools are emerging as the most promising technique to make FPGAs more accessible to software developers. Nevertheless, picking the most suitable HLS for a certain class of algorithms depends on requirements such as area and throughput, as well as on programmer experience. In this paper, we explore the different trade-offs present when using a representative set of HLS tools in the context of Database Management Systems (DBMS) acceleration. More specifically, we conduct an empirical analysis of four representative frameworks (Bluespec SystemVerilog, Altera OpenCL, LegUp and Chisel) that we utilize to accelerate commonly-used database algorithms such as sorting, the median operator, and hash joins. Through our implementation experience and empirical results for database acceleration, we conclude that the selection of the most suitable HLS depends on a set of orthogonal characteristics, which we highlight for each HLS framework.Peer ReviewedPostprint (author’s final draft

    Using Efficient Path Profiling to Optimize Memory Consumption of On-Chip Debugging for High-Level Synthesis

    Get PDF
    High-Level Synthesis (HLS) for FPGAs is attracting popularity and is increasingly used to handle complex systems with multiple integrated components. To increase performance and efficiency, HLS flows now adopt several advanced optimization techniques. Aggressive optimizations and system level integration can cause the introduction of bugs that are only observable on-chip. Debugging support for circuits generated with HLS is receiving a considerable attention. Among the data that can be collected on chip for debugging, one of the most important is the state of the Finite State Machines (FSM) controlling the components of the circuit. However, this usually requires a large amount of memory to trace the behavior during the execution. This work proposes an approach that takes advantage of the HLS information and of the structure of the FSM to compress control flow traces and to integrate optimized components for on-chip debugging. The generated checkers analyze the FSM execution on-fly, automatically notifying when a bug is detected, localizing it and providing data about its cause. The traces are compressed using a software profiling technique, called Efficient Path Profiling (EPP), adapted for the debugging of hardware accelerators generated with HLS. With this technique, the size of the memory used to store control flow traces can be reduced up to 2 orders of magnitude, compared to state-of-the-art

    Siirtoliipaisuarkkitehtuurin muuttuvanmittaisten käskyjen pakkaus

    Get PDF
    The Static Random-Access Memory (SRAM) modules used for embedded microprocessor devices consume a large portion of the whole system’s power. The memory module consumes static power on keeping awake and dynamic power on memory accesses. The power dissipation of the instruction memory can be limited by using code compression methods, which reduce the memory size. The compression may require the use of variable length instruction formats in the processor. The power-efficient design of variable length instruction fetch and decode units is challenging for static multiple-issue processors, because such architectures have simple hardware to begin with, as they aim for very low power consumption on embedded platforms. The power saved by using these compression approaches, which necessitate more complex logic, is easily lost on inefficient processor design. This thesis proposes an implementation for instruction template-based compression, its decompression and two instruction fetch design alternatives for variable length instruction encoding on Transport Triggered Architecture (TTA), a static multiple-issue exposed data path architecture. Both of the new fetch and decode units are integrated into the TTA-based Co-design Environment (TCE), which is a toolset for rapid designing and prototyping of processors based on TTA. The hardware description of the fetch units is verified on a register transfer level and benchmarked using the CHStone test suite. Furthermore, the fetch units are synthesized on a 40 nm standard cell Application Specific Integrated Circuit (ASIC) technology library for area, performance and power consumption measurements. The power cost of the variable length instruction support is compared to the power savings from memory reduction, which is evaluated using HP Labs’ CACTI tool. The compression approach reaches an average program size reduction of 44% at best with a set of test programs, and the total power consumption of the system is reduced. The thesis shows that the proposed variable length fetch designs are sufficiently low-power oriented for TTA processors to benefit from the code compression
    corecore