Search CORE

435 research outputs found

C++ coding principles for high-level synthesis

Author: Vuopio H. (Heikki)
Publication venue: University of Oulu
Publication date: 18/08/2021
Field of study

Abstract. High-level synthesis (HLS) raises the level of abstraction on digital integrated circuit design from traditional register transfer level (RTL) to behavioural system description level. This methodology offers great advantages such as increased designer productivity. The adoption of HLS, however, has been slowed down by the RTL code mistakenly generated with HLS which potentially results in poor quality compared to the traditional hand-written RTL. This thesis aims to solve this problem by finding the best programming practices for hardware-oriented C++. A digital downconverter and decimator are designed and implemented with Catapult HLS as a case study, where different coding practises are experimented with, and the best ones are generalized and presented. The quality of results of this case study is compared against a hand-written RTL design of the same intellectual property created by other designers. A few examples are presented as well demonstrating that small changes in the source code might have a major effect on the generated RTL. It is found that understanding how the HLS tool analyses the source code and executes operations in parallel greatly helps to improve the quality of results in the generated hardware. Also, by having a clear target architecture it is a simple task to verify the hardware in Catapult analysis views such as schedule and schematic view. By optimizing the source code, it is possible to generate similar quality hardware compared to traditional RTL flow. In this case, the area of the HLS design is about 19 % smaller than the RTL design with the same throughput, slightly lower latency, and roughly the same power consumption. C++ ohjelmointikäytännöt korkean tason synteesiin. Tiivistelmä. Korkean tason synteesi (HLS) nostaa digitaalisten integroitujen piirien suunnittelun abstraktiotason perinteiseltä rekisterinsiirtotasolta (RTL) systeemikuvaustasolle. Tämä metodologia tuo suuria etuja, kuten suunnittelijan korkeampi tuotteliaisuus. HLS:n laajempaa käyttöönottoa on kuitenkin hidastanut erheellisesti HLS:llä generoitu RTL-koodi, josta usein seuraa heikohko laatu käsin kirjoitettuun RTL-koodiin verrattuna. Tämän tutkimuksen tavoite on ratkaista tämä ongelma löytämällä parhaat ohjelmointikäytännöt korkean tason synteesiin suunnattuun C++-ohjelmointiin. Digitaalinen alasmuunnin ja desimaattori suunnitellaan ja implementoidaan käyttäen Catapult HLS-työkalua. Eri ohjelmointikäytäntöjä testataan ja parhaat yleistetään ja esitellään, minkä jälkeen tulosten laatua verrataan samaan lohkoon, jonka on ohjelmoinut eri suunnittelijat rekisterinsiirtotasolla. Tutkimus sisältää myös koodiesimerkkejä siitä, miten pienet muutokset lähdekoodissa voivat vaikuttaa merkittävästi lopputulokseen. Tutkimuksessa todetaan, että synteesityökalun toiminnan ymmärtäminen on kriittistä hyvien tulosten saavuttamisen kannalta. Suunnittelijalla tulisi olla selvä tavoitearkkitehtuuri generoitavasta RTL-koodista, jolloin sen varmentaminen synteesin jälkeen olisi helppoa Catapultin analyysinäkymissä. Optimoimalla lähdekoodia generoidun RTL-koodin tulosten laatu saadaan samaksi kuin käsin kirjoitetun RTL-koodin. Tässä tapauksessa generoidun RTL-koodin pinta-ala on 19 % pienempi kuin käsin kirjoitetun mallin samalla siirtonopeudella. Latenssi on hieman pienempi ja tehonkulutus samaa suuruusluokkaa

University of Oulu Repository - Jultika

Design of testbed and emulation tools

Author: Flynn M. J.
Lundstrom S. F.
Publication venue
Publication date
Field of study

The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

NASA Technical Reports Server

Recommended from our members

Memory-Based High-Level Synthesis Optimizations Security Exploration on the Power Side-Channel

Author: Blackstone Jeremy
Hu Wei
Kastner Ryan
Mu Dejun
Tai Yu
Zhang Lu
Publication venue: eScholarship, University of California
Publication date: 01/10/2020
Field of study

High-level synthesis (HLS) allows hardware designers to think algorithmically and not worry about low-level, cycle-by-cycle details. This provides the ability to quickly explore the architectural design space and tradeoffs between resource utilization and performance. Unfortunately, security evaluation is not a standard part of the HLS design flow. In this article, we aim to understand the effects of memory-based HLS optimizations on power side-channel leakage. We use Xilinx Vivado HLS to develop different cryptographic cores, implement them on a Spartan-6 FPGA, and collect power traces. We evaluate the designs with respect to resource utilization, performance, and information leakage through power consumption. We have two important observations and contributions. First, the choice of resource optimization directive results in different levels of side-channel vulnerabilities. Second, the partitioning optimization directive can greatly compromise the hardware cryptographic system through power side-channel leakage due to the deployment of memory control logic. We describe an evaluation procedure for power side-channel leakage and use it to make best-effort recommendations about how to design more secure architectures in the cryptographic domain

eScholarship - University of California

Design of data buffers in field programmablr gate arrays

Author: Molchanova Anastasiia
Mozghovyi Ivan
Sergiyenko Anatoliy
Serhiienko Pavlo
Publication venue: 'Kyiv Politechnic Institute'
Publication date: 01/01/2022
Field of study

The design of the data buffers for the field programable gate array (FPGA) projects is considered. A new method of buffer design is proposed, which is based on the representation of the synchronous dataflow graph in the three-dimensional space, optimization of them, and description in VHDL. The method gives the optimized buffers which are based either on RAM or on the register pipeline. The derived pipeline buffer can be mapped into the shift register primitive of FPGA. The method is built in the experimental SDFCAD framework intended for the pipelined datapath synthesis

Electronic Archive of Kyiv Polytechnic Institute

Compiling for an Heterogeneous Vector Image Processor

Author: Coelho Fabien
Irigoin François
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceWe present a new compilation strategy, implemented at a small cost, to optimize image applications developed on top of a high level image processing library for an heterogeneous processor with a vector image processing accelerator. The library provides the semantics of the image computations. The pipelined structure of the accelerator allows to compute whole expressions with dozens of elementary image instructions, but is constrained as intermediate image values cannot be extracted. We adapted standard compilation techniques to perform this task automatically. Our strategy is implemented in PIPS, a source-to-source compiler which greatly reduces the development cost as standard phases are reused and parameterized for the target. Experiments were run on the hardware functional simulator. We compile 1217 cases, from elementary tests to full applications. All are optimal but a few which are mostly within a mere accelerator call of optimality. Our contribu- tions include: 1) a general low cost compilation strategy for image processing applications, based on the semantics provided by library calls, which improves locality by an order of magnitude; 2) a specific heuristic to minimize execution time on the target vector accelerator; 3) numerous experiments that show the effectiveness of our strategy

CiteSeerX

HAL-MINES ParisTech

Implementation and Performance Modeling of Deterministic Particle Transport (Sweep3D) on the IBM Cell/B.E.

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2009
Field of study

Crossref

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

AGAMOS: A graph-based approach to modulo scheduling for clustered microarchitectures

Author: Aleta Ortega Alexandre
Codina Viñas Josep M.
González Colás Antonio María
Kaeli D
Sánchez Navarro F. Jesús
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

This paper presents AGAMOS, a technique to modulo schedule loops on clustered microarchitectures. The proposed scheme uses a multilevel graph partitioning strategy to distribute the workload among clusters and reduces the number of intercluster communications at the same time. Partitioning is guided by approximate schedules (i.e., pseudoschedules), which take into account all of the constraints that influence the final schedule. To further reduce the number of intercluster communications, heuristics for instruction replication are included. The proposed scheme is evaluated using the SPECfp95 programs. The described scheme outperforms a state-of-the-art scheduler for all programs and different cluster configurations. For some configurations, the speedup obtained when using this new scheme is greater than 40 percent, and for selected programs, performance can be more than doubled.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

Author: H Rong
KD Cooper
KD Cooper
KD Cooper
KD Cooper
L Almagor
L Renganarayanan
L Wang
M Stephenson
MD Smith
P Kulkarni
PA Kulkarni
PMW Knijnenburg
RC Whaley
S Hack
S Kulkarni
S Rubin
Vasilios Kelefouras
VI Kelefouras
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/01/2017
Field of study

Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time

Crossref

Sheffield Hallam University Research Archive

Plymouth Electronic Archive and Research Library

White Rose Research Online