9 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
FPGA acceleration of structured-mesh-based explicit and implicit numerical solvers using SYCL
We explore the design and development of structured-mesh based solvers on current Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) multidimensional tridiagonal solvers based on implicit methods. Both classes of solvers appear as core modules in a wide-range of realworld applications ranging from CFD to financial computing. A general, unified workflow is formulated for synthesizing them on Intel FPGAs together with predictive analytic models to explore the design space to obtain near-optimal performance. Performance of synthesized designs, using the above techniques, for two non-trivial applications on an Intel PAC D5005 FPGA card is benchmarked. Results are compared to performance of optimized parallel implementations of the same applications on a Nvidia V100 GPU. Observed runtime results indicate the FPGA providing better or matching performance to the V100 GPU. However, more importantly the FPGA solutions provide 59%-76% less energy consumption for their largest configurations, making them highly attractive for solving workloads based on these applications in production settings. The performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating their significant utility for design space explorations. With these tools and techniques, we discuss determinants for a given structuredmesh code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design, how they can be codified using SYCL and the resulting performance
Developing a support for FPGAs in the Controller parallel programming model
La computación heterogénea se presenta como la solución para conseguir supercomputadores cada vez
más rápidos capaces de resolver problemas más grandes y complejos en diferentes áreas de conocimiento.
Para ello, integra aceleradores con distintas arquitecturas capaces de explotar las características de los
problemas desde distintos enfoques obteniendo, de este modo, un mayor rendimiento.
Las FPGAs son hardware reconfigurable, i.e., es posible modificarlas después de su fabricación. Esto
permite una gran flexibilidad y una máxima adaptación al problema en cuestión. Además, tienen un
consumo energético muy bajo. Todas estas ventajas tienen el gran inconveniente de una más difícil programaci
ón mediante los propensos a errores HDLs (Hardware Description Language), tales como Verilog o
VHDL, y requisitos de conocimientos avanzados de electrónica digital. En los últimos años los principales
fabricantes de FPGAs han enfocado sus esfuerzos en desarrollar herramientas HLS (High Level Synthesis)
que permiten programarlas a través de lenguajes de programación de alto nivel estilo C. Esto ha favorecido
su adopción por la comunidad HPC y su integración en los nuevos supercomputadores. Sin embargo, el
programador aún tiene que ocuparse de aspectos como la gestión de colas de comandos, parámetros de
lanzamiento o transferencias de datos.
El modelo Controller es una librería que facilita la gestión de la coordinación, comunicación y los
detalles de lanzamiento de los kernels en aceleradores hardware. Explota de forma transparente sus modelos
de programación nativos, en concreto OpenCL y CUDA, y, por tanto, consigue un alto rendimiento
independientemente del compilador. Permite al programador utilizar los distintos recursos hardware
disponibles de forma combinada en entornos heterogéneos.
Este trabajo extiende el modelo Controller mediante el desarrollo de un backend que permite la
integración de FPGAs, manteniendo los cambios sobre la interfaz de usuario al mínimo. A través de los
resultados experimentales se comprueba que se consigue una disminución del esfuerzo de programación
significativa en comparación con la implementación nativa en OpenCL. Del mismo modo, se consigue
un elevado solapamiento entre computación y comunicación y un sobrecoste por el uso de la librería
despreciable.Heterogeneous computing appears to be the solution to achieve ever faster computers capable of solving
bigger and more complex problems in difierent fields of knowledge. To that end, it integrates accelerators
with difierent architectures capable of exploiting the features of problems from difierent perspectives thus
achieving higher performance.
FPGAs are reconfigurable hardware, i.e., it is possible to modify them after manufacture. This allows
great flexibility and maximum adaptability to the given problem. In addition, they have low power
consumption. All these advantages have the great objection of more dificult programming with the errorprone
HDLs (Hardware Description Language), such as Verilog or VHDL, and the requirement of advanced
knowledge of digital electronics. The main FPGA vendors have concentrated on developing HLS (High
Level Synthesis) tools that allow to program them with C-like high level programming languages. This
favoured their adoption by the HPC community and their integration in new supercomputers. However,
the programmer still has to take care of aspects such as management of command queues, launching
parameters or data transfers.
The Controller model is a library to easily manage the coordination, communication and kernel launching
details on hardware accelerators. It transparently exploits their native or vendor specific programming
models, namely OpenCL and CUDA, thus enabling the potential performance obtained by using them in
a compiler agnostic way. It is intended to enable the programmer to make use of the diferent available
hardware resources in combination in heterogeneous environments.
This work extends the Controller model through the development of a backend that allows the integration
of FPGAs, keeping the changes over the user-facing interface to the minimum. The experimental
results validate that a significant decrease in programming effort compared to the native OpenCL implementation
is achieved. Similarly, high overlap of computation and communication and a negligible
overhead due to the use of the library are attained.Grado en Ingeniería Informátic
Evaluación de SLAMBench en un sistema completamente heterogéneo: CPU, GPU y FPGA
Existe una tendencia en los últimos años hacia usar los sistemas heterogéneos para implementar aplicaciones con requerimientos muy estrictos de rendimiento o consumo de energía, que los sistemas homogéneos no son capaces de satisfacer. En este proyecto se presenta un estudio sobre KinectFusion, una carga de trabajo significativa de visión por computador, ejecutado en un sistema 3-heterogéneo, compuesto por CPU, GPU y FPGA. Mediante el uso del benchmark SLAMBench se evalúa el impacto que tienen diversos tipos de optimizaciones sobre la implementación original de KinectFusion y enfocadas a la FPGA, pues es el dispositivo que en teoría ofrece mejor relación rendimiento/consumo energético. En general, los resultados demuestran que el principal cuello de botella en la FPGA es la transferencia de los buffers de entrada y salida debido a las limitaciones del hardware. Sin embargo, aunque la GPU alcanza 3,5 veces más ancho de banda, la FPGA es un dispositivo capaz de dar un rendimiento competitivo y podría usarse como acelerador en un sistema más equilibrado. Esto es gracias a las diversas técnicas de optimización que se pueden aplicar y con las que se han conseguido, para el caso estudiado, un speedup de 16,54 con respecto a la implementación original. Las optimizaciones que mayor impacto han tenido están relacionadas con el alineamiento y patrón de acceso a memoria. Por el contrario, la técnica más importante que no ha dado resultados es el uso de representación en coma fija, ya que los requerimientos de precisión y el overhead producido por las conversiones hunden el rendimiento de la aplicación. Este trabajo sienta las bases para una futura investigación más detallada sobre el papel que podría jugar una FPGA en entornos heterogéneos y, en concreto, sobre su posible rol en un algoritmo complejo como KFusion
High-level FPGA accelerator design for structured-mesh-based numerical solvers
Field Programmable Gate Arrays (FPGAs) have become highly attractive as accelerators due to their low power consumption and re-programmability. However, a key limitation is the time and know-how required to program them. Even with high-level synthesis tools, they still require significant hand-tuned/low-level customizations and design space exploration to gain good performance. The need to program FPGAs using the dataflow programming model, much less well known and practised by the high-performance computing (HPC) community, is a major barrier for adoption for HPC. The underlying motivation of this work is to bridge this gap - attaining near-optimal performance vs the ease of programming. To this end, we target the important class of applications based on structured meshes, focusing on numerical algorithms based on explicit and implicit techniques. We leverage the main characteristics of the application class, its computation-communication pattern and the hardware features. For explicit schemes, characterized by stencil computations, we unify the state-of-the-art techniques such as vectorization and unrolling with a number of new high-gain optimizations such as creating perfect data reuse data-paths, batching and tiling. A key new feature is their applicability to multiple stencil loops enabling the development of real-world workloads. For implicit schemes, we re-evaluate the characteristics of the tridiagonal system solver algorithms for FPGAs and develop a new high throughput batched multi-dimensional tridiagonal system solver library with orders of magnitude better performance than the state-of-the-art. New analytic models are developed to support the solvers, elucidating and modelling the critical path of execution and parameterizing the design. This together with the optimal designs and new library lead to a unified design work-flow for synthesis on FPGAs. The new workflow is used to implement a range of applications, from simple single stencil designs, multiple stencil loops to solvers with real-world utility. They are synthesized on the currently dominant Xilinx and Intel FPGAs. Benchmarking indicate the FPGAs matching or outperforming the best GPU implementations, the current best traditional architecture device solution. Over 30% energy saving can also be observed. The performance model demonstrates over 85% accuracy. The thesis discusses the determinants for these applications to be amenable for FPGA implementation, providing insights into the feasibility and profitability of a design. Finally we propose initial steps in automating the workflow to be used through a DSL
マルチレベル並列化とアプリケーション指向データレイアウトを用いるハードウェアアクセラレータの設計と実装
学位の種別: 課程博士審査委員会委員 : (主査)東京大学教授 稲葉 雅幸, 東京大学教授 須田 礼仁, 東京大学教授 五十嵐 健夫, 東京大学教授 山西 健司, 東京大学准教授 稲葉 真理, 東京大学講師 中山 英樹University of Tokyo(東京大学