Search CORE

81 research outputs found

Transformations of High-Level Synthesis Codes for High-Performance Computing

Author: Besta Maciej
Hoefler Torsten
Licht Johannes de Fine
Meierhans Simon
Publication venue
Publication date: 29/10/2019
Field of study

Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

arXiv.org e-Print Archive

Repository for Publications and Research Data

Source-to-Source transformations for efficient SIMD code generation

Author: Berna Juan Alejandro
Jiménez Castells Marta
Llaberia Griñó José M.
Publication venue
Publication date: 01/01/2011
Field of study

In the last years, there has been much effort in commercial compilers to generate efficient SIMD instructions-based code sequences from conventional sequential programs. However, the small numbers of compilers that can automatically use these instructions achieve in most cases unsatisfactory results. Therefore, the code often has to be written manually in assembly language or using compiler built-in functions to achieve high performance. In this work, we present source-to-source transformations that help commercial vectorizing compilers to generate efficient SIMD code. Experimental results show that excellent performance can be achieved. In particular, for the problem of matrix product (SGEMM) we almost achieve as high performance as hand-optimized numerical libraries. Our source-tosource transformations are based on the scalar replacement and unroll and jam transformations presented by Callahan et all. In particular, we extend the use of scalar replacement to vectorial replacement and combine this transformation with unroll and jam and outer loop vectorization to fully exploit the vector register level and thus to help the compiler to generate efficient SIMD code. We will show experimentally the effectiveness of our proposal.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Automatically Optimizing Tree Traversal Algorithms

Author: Jo Youngjoon
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2013
Field of study

Many domains in computer science, from data-mining to graphics to computational astrophysics, focus heavily on irregular applications. In contrast to regular applications, which operate over dense matrices and arrays, irregular programs manipulate and traverse complex data structures like trees and graphs. As irregular applications operate on ever larger datasets, their performance suffers from poor locality and parallelism. Programmers are burdened with the arduous task of manually tuning such applications for better performance. Generally applicable techniques to optimize irregular applications are highly desired, yet scarce. In this dissertation, we argue that, for an important subset of irregular programs which arises in many domains, namely, tree traversal algorithms like Barnes-Hut, nearest neighbor and ray tracing, there exist general techniques to enhance performance. We investigate two sources of performance improvement: locality enhancement and vectorization. Furthermore we demonstrate that these techniques can be automatically applied by an optimizing compiler, relieving programmers of manual, error-prone, application-specific effort. Achieving high performance in many applications requires achieving good locality of reference. We propose two novel transformations called point blocking and traversal splicing, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in tree traversals. We then present a transformation framework called TreeSplicer, that automatically applies these transformations, and uses autotuning techniques to determine appropriate parameters for the transformations. For six benchmark algorithms, we show that a combination of point blocking and traversal splicing can deliver single-thread speedups of up to 8.71 (geometric mean: 2.48), just from better locality. Modern commodity processors support SIMD instructions, and using these instructions to process multiple traversals at once has the potential to provide substantial performance improvements. Unfortunately tree algorithms often feature highly diverging traversals which inhibit efficient SIMD utilization, to the point that other, less profitable sources of vectorization must be exploited instead. We propose a dynamic reordering of traversals based on previous behavior, based on the insight that traversals which have behaved similarly so far are likely to behave similarly in the future, and show that this reordering can dramatically improve the SIMD utilization of diverging traversals, close to ideal utilization. We present a transformation framework, SIMTree, which facilitates vectorization of tree algorithms, and demonstrate speedups of up to 6.59 (geometric mean: 2.78). Furthermore our techniques can effectively SIMDize algorithms that prior, manual vectorization attempts could not

Purdue E-Pubs

Scaling and optimizing the Gysela code on a cluster of many-core processors

Author: Asahi Yuuichi
Bigot Julien
Fehér Tamás
Grandgirard Virginie
Latu Guillaume
Publication venue: HAL CCSD
Publication date: 24/09/2018
Field of study

International audienceThe current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPI/OpenMP can be used. This specific hardware offers both large memory bandwidth and large computing resources and is currently available on computing facilities. Many factors impact the performance achieved by applications, one of the key points is the efficient exploitation of SIMD vector units, another one is the memory access pattern. Thus, vectorization and optimization works have been conducted on a plasma turbulence application, namely Gysela. A set of different techniques have been used: loop splitting, inlining, grouping a set of LU solve operations, removing conditionals and some loop nests, auto-tuning of one computation kernel, changing a key numerical scheme – Lagrange interpolation instead of cubic splines. As a result, KNL execution times have been reduced by up to a factor 3 in some configurations. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work for vectorizing the Gysela code meant a large payoff without resorting to writing assembly code or using low-level intrinsics

INRIA a CCSD electronic archive server

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

Vectorization system for unstructured codes with a Data-parallel Compiler IR

Author: Moll Simon
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2021
Field of study

With Dennard Scaling coming to an end, Single Instruction Multiple Data (SIMD) oﬀers itself as a way to improve the compute throughput of CPUs. One fundamental technique in SIMD code generators is the vectorization of data-parallel code regions. This has applications in outer-loop vectorization, whole-function vectorization and vectorization of explicitly data-parallel languages. This thesis makes contributions to the reliable vectorization of data-parallel code regions with unstructured, reducible control ﬂow. Reducibility is the case in practice where all control-ﬂow loops have exactly one entry point. We present P-LLVM, a novel, full-featured, intermediate representation for vectorizers that provides a semantics for the code region at every stage of the vectorization pipeline. Partial control-ﬂow linearization is a novel partial if-conversion scheme, an essential technique to vectorize divergent control ﬂow. Diﬀerent to prior techniques, partial linearization has linear running time, does not insert additional branches or blocks and gives proved guarantees on the control ﬂow retained. Divergence of control induces value divergence at join points in the control-ﬂow graph (CFG). We present a novel control-divergence analysis for directed acyclic graphs with optimal running time and prove that it is correct and precise under common static assumptions. We extend this technique to obtain a quadratic-time, control-divergence analysis for arbitrary reducible CFGs. For this analysis, we show on a range of realistic examples how earlier approaches are either less precise or incorrect. We present a feature-complete divergence analysis for P-LLVM programs. The analysis is the ﬁrst to analyze stack-allocated objects in an unstructured control setting. Finally, we generalize single-dimensional vectorization of outer loops to multi-dimensional tensorization of loop nests. SIMD targets beneﬁt from tensorization through more opportunities for re-use of loaded values and more eﬃcient memory access behavior. The techniques were implemented in the Region Vectorizer (RV) for vectorization and TensorRV for loop-nest tensorization. Our evaluation validates that the general-purpose RV vectorization system matches the performance of more specialized approaches. RV performs on par with the ISPC compiler, which only supports its structured domain-speciﬁc language, on a range of tree traversal codes with complex control ﬂow. RV is able to outperform the loop vectorizers of state-of-the-art compilers, as we show for the SPEC2017 nab_s benchmark and the XSBench proxy application.Mit dem Ausreizen des Dennard Scalings erreichen die gewohnten Zuwächse in der skalaren Rechenleistung zusehends ihr Ende. Moderne Prozessoren setzen verstärkt auf parallele Berechnung, um den Rechendurchsatz zu erhöhen. Hierbei spielen SIMD Instruktionen (Single Instruction Multiple Data), die eine Operation gleichzeitig auf mehrere Eingaben anwenden, eine zentrale Rolle. Eine fundamentale Technik, um SIMD Programmcode zu erzeugen, ist der Einsatz datenparalleler Vektorisierung. Diese unterliegt populären Verfahren, wie der Vektorisierung äußerer Schleifen, der Vektorisierung gesamter Funktionen bis hin zu explizit datenparallelen Programmiersprachen. Der Beitrag der vorliegenden Arbeit besteht darin, ein zuverlässiges Vektorisierungssystem für datenparallelen Code mit reduziblem Steuerﬂuss zu entwickeln. Diese Anforderung ist für alle Steuerﬂussgraphen erfüllt, deren Schleifen nur einen Eingang haben, was in der Praxis der Fall ist. Wir präsentieren P-LLVM, eine ausdrucksstarke Zwischendarstellung für Vektorisierer, welche dem Programm in jedem Stadium der Transformation von datenparallelem Code zu SIMD Code eine deﬁnierte Semantik verleiht. Partielle Steuerﬂuss-Linearisierung ist ein neuer Algorithmus zur If-Conversion, welcher Sprünge erhalten kann. Anders als existierende Verfahren hat Partielle Linearisierung eine lineare Laufzeit und fügt keine neuen Sprünge oder Blöcke ein. Wir zeigen Kriterien, unter denen der Algorithmus Steuerﬂuss erhält, und beweisen diese. Steuerﬂussdivergenz induziert Divergenz an Punkten zusammenﬂießenden Steuerﬂusses. Wir stellen eine neue Steuerﬂussdivergenzanalyse für azyklische Graphen mit optimaler Laufzeit vor und beweisen deren Korrektheit und Präzision. Wir verallgemeinern die Technik zu einem Algorithmus mit quadratischer Laufzeit für beliebiege, reduzible Steuerﬂussgraphen. Eine Studie auf realistischen Beispielgraphen zeigt, dass vergleichbare Techniken entweder weniger präsize sind oder falsche Ergebnisse liefern. Ebenfalls präsentieren wir eine Divergenzanalyse für P-LLVM Programme. Diese Analyse ist die erste Divergenzanalyse, welche Divergenz in stapelallokierten Objekten unter unstrukturiertem Steuerﬂuss analysiert. Schließlich generalisieren wir die eindimensionale Vektorisierung von äußeren Schleifen zur multidimensionalen Tensorisierung von Schleifennestern. Tensorisierung eröﬀnet für SIMD Prozessoren mehr Möglichkeiten, bereits geladene Werte wiederzuverwenden und das Speicherzugriﬀsverhalten des Programms zu optimieren, als dies mit Vektorisierung der Fall ist. Die vorgestellten Techniken wurden in den Region Vectorizer (RV) für Vektorisierung und TensorRV für die Tensorisierung von Schleifennestern implementiert. Wir zeigen auf einer Reihe von steuerﬂusslastigen Programmen für die Traversierung von Baumdatenstrukturen, dass RV das gleiche Niveau erreicht wie der ISPC Compiler, welcher nur seine strukturierte Eingabesprache verarbeiten kann. RV kann schnellere SIMD-Programme erzeugen als die Schleifenvektorisierer in aktuellen Industriecompilern. Dies demonstrieren wir mit dem nab_s benchmark aus der SPEC2017 Benchmarksuite und der XSBench Proxy-Anwendung

Universaar

Acronym