Search CORE

499 research outputs found

Exploiting Outer Loops Vectorization in High Level Synthesis

Author: A Cilardo
A Morvan
A Papakonstantinou
J Cong
SA Mahlke
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Exploiting Vectorization in High Level Synthesis of Nested Irregular Loops

Author: Ferrandi Fabrizio
Lattuada Marco
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Synthesis of DoAll loops is a key aspect of High Level Synthesis since they allow to easily exploit the potential parallelism provided by programmable devices. This type of parallelism can be implemented in several ways: by duplicating the implementation of body loop, by exploiting loop pipelining or by applying vectorization. In this paper a methodology for the synthesis of nested irregular DoAll loops based on outer vectorization is proposed. The methodology transforms the intermediate representation of the DoAll loop to introduce vectorization and it can be easily integrated in existing state of the art High Level Synthesis flows since does not require any modification in the rest of the flow. Vectorization is not limited to perfectly nested countable loops: conditional constructs and loops with variable number of iterations are supported. Experimental results on parallel benchmarks show that the generated parallel accelerators have significant speed-up with limited penalties in terms of resource usage and frequency decrement

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Transformations of High-Level Synthesis Codes for High-Performance Computing

Author: Besta Maciej
Hoefler Torsten
Licht Johannes de Fine
Meierhans Simon
Publication venue
Publication date: 29/10/2019
Field of study

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

arXiv.org e-Print Archive

Repository for Publications and Research Data

Three-dimensional memory vectorization for high bandwidth media memory systems

Author: Corbal San Adrián Jesús
Espasa Sans Roger
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

Vector processors have good performance, cost and adaptability when targeting multimedia applications. However, for a significant number of media programs, conventional memory configurations fail to deliver enough memory references per cycle to feed the SIMD functional units. This paper addresses the problem of the memory bandwidth. We propose a novel mechanism suitable for 2-dimensional vector architectures and targeted at providing high effective bandwidth for SIMD memory instructions. The basis of this mechanism is the extension of the scope of vectorization at the memory level, so that 3-dimensional memory patterns can be fetched into a second-level register file. By fetching long blocks of data and by reusing 2-dimensional memory streams at this second-level register file, we obtain a significant increase in the effective memory bandwidth. As side benefits, the new 3-dimensional load instructions provide a high robustness to memory latency and a significant reduction of the cache activity, thus reducing power and energy requirements. At the investment of a 50% more area than a regular SIMD register file, we have measured and average speed-up of 13% and the potential for power savings in the L2 cache of a 30%.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

FBLAS: Streaming Linear Algebra on FPGA

Author: De Matteis Tiziano
Hoefler Torsten
Licht Johannes de Fine
Publication venue: 'Test accounts'
Publication date: 01/01/2020
Field of study

Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity and lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures requires factoring in new transformations and resources/performance trade-offs. We present FBLAS, an open-source HLS implementation of BLAS for FPGAs, that enables reusability, portability and easy integration with existing software and hardware codes. FBLAS' implementation allows scaling hardware modules to exploit on-chip resources, and module interfaces are designed to natively support streaming on-chip communications, allowing them to be composed to reduce off-chip communication. With FBLAS, we set a precedent for FPGA library design, and contribute to the toolbox of customizable hardware components necessary for HPC codes to start productively targeting reconfigurable platforms

arXiv.org e-Print Archive

Repository for Publications and Research Data

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

Feedback Driven Annotation and Refactoring of Parallel Programs

Author: Larsen Per
Publication venue: Technical University of Denmark
Publication date: 01/01/2011
Field of study

Online Research Database In Technology