21 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
FBLAS: Streaming Linear Algebra on FPGA
Spatial computing architectures pose an attractive alternative to mitigate
control and data movement overheads typical of load-store architectures. In
practice, these devices are rarely considered in the HPC community due to the
steep learning curve, low productivity and lack of available libraries for
fundamental operations. High-level synthesis (HLS) tools are facilitating
hardware programming, but optimizing for these architectures requires factoring
in new transformations and resources/performance trade-offs. We present FBLAS,
an open-source HLS implementation of BLAS for FPGAs, that enables reusability,
portability and easy integration with existing software and hardware codes.
FBLAS' implementation allows scaling hardware modules to exploit on-chip
resources, and module interfaces are designed to natively support streaming
on-chip communications, allowing them to be composed to reduce off-chip
communication. With FBLAS, we set a precedent for FPGA library design, and
contribute to the toolbox of customizable hardware components necessary for HPC
codes to start productively targeting reconfigurable platforms
Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping
The multi-pumping resource sharing technique can overcome the limitations
commonly found in single-clocked FPGA designs by allowing hardware components
to operate at a higher clock frequency than the surrounding system. However,
this optimization cannot be expressed in high levels of abstraction, such as
HLS, requiring the use of hand-optimized RTL. In this paper we show how to
leverage multiple clock domains for computational subdomains on reconfigurable
devices through data movement analysis on high-level programs. We offer a novel
view on multi-pumping as a compiler optimization - a superclass of traditional
vectorization. As multiple data elements are fed and consumed, the computations
are packed temporally rather than spatially. The optimization is applied
automatically using an intermediate representation that maps high-level code to
HLS. Internally, the optimization injects modules into the generated designs,
incorporating RTL for fine-grained control over the clock domains. We obtain a
reduction of resource consumption by up to 50% on critical components and 23%
on average. For scalable designs, this can enable further parallelism,
increasing overall performance
Python FPGA Programming with Data-Centric Multi-Level Design
Although high-level synthesis (HLS) tools have significantly improved
programmer productivity over hardware description languages, developing for
FPGAs remains tedious and error prone. Programmers must learn and implement a
large set of vendor-specific syntax, patterns, and tricks to optimize (or even
successfully compile) their applications, while dealing with ever-changing
toolflows from the FPGA vendors. We propose a new way to develop, optimize, and
compile FPGA programs. The Data-Centric parallel programming (DaCe) framework
allows applications to be defined by their dataflow and control flow through
the Stateful DataFlow multiGraph (SDFG) representation, capturing the abstract
program characteristics, and exposing a plethora of optimization opportunities.
In this work, we show how extending SDFGs with multi-level Library Nodes
incorporates both domain-specific and platform-specific optimizations into the
design flow, enabling knowledge transfer across application domains and FPGA
vendors. We present the HLS-based FPGA code generation backend of DaCe, and
show how SDFGs are code generated for either FPGA vendor, emitting efficient
HLS code that is structured and annotated to implement the desired
architecture
Co-design Hardware and Algorithm for Vector Search
Vector search has emerged as the foundation for large-scale information
retrieval and machine learning systems, with search engines like Google and
Bing processing tens of thousands of queries per second on petabyte-scale
document datasets by evaluating vector similarities between encoded query texts
and web documents. As performance demands for vector search systems surge,
accelerated hardware offers a promising solution in the post-Moore's Law era.
We introduce \textit{FANNS}, an end-to-end and scalable vector search framework
on FPGAs. Given a user-provided recall requirement on a dataset and a hardware
resource budget, \textit{FANNS} automatically co-designs hardware and
algorithm, subsequently generating the corresponding accelerator. The framework
also supports scale-out by incorporating a hardware TCP/IP stack in the
accelerator. \textit{FANNS} attains up to 23.0 and 37.2 speedup
compared to FPGA and CPU baselines, respectively, and demonstrates superior
scalability to GPUs, achieving 5.5 and 7.6 speedup in median
and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator
configuration. The remarkable performance of \textit{FANNS} lays a robust
groundwork for future FPGA integration in data centers and AI supercomputers.Comment: 11 page
AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
Stencil computation is one of the most widely-used compute patterns in high
performance computing applications. Spatial and temporal blocking have been
proposed to overcome the memory-bound nature of this type of computation by
moving memory pressure from external memory to on-chip memory on GPUs. However,
correctly implementing those optimizations while considering the complexity of
the architecture and memory hierarchy of GPUs to achieve high performance is
difficult. We propose AN5D, an automated stencil framework which is capable of
automatically transforming and optimizing stencil patterns in a given C source
code, and generating corresponding CUDA code. Parameter tuning in our framework
is guided by our performance model. Our novel optimization strategy reduces
shared memory and register pressure in comparison to existing implementations,
allowing performance scaling up to a temporal blocking degree of 10. We achieve
the highest performance reported so far for all evaluated stencil benchmarks on
the state-of-the-art Tesla V100 GPU
Productive FPGA Programming for High-Performance Computing
For decades, the computational performance of processors has grown at a faster rate than the available memory bandwidth. As a result, most transistors in modern processors are spent on managing data movement via caches and registers. Spatial computing architectures can omit general purpose caches, registers, and control logic by implementing application-specific dataflow, where computations are laid out spatially. Programmable spatial architectures, such as FPGAs, can implement application-specific dataflow, but the steep learning curve of hardware programming prevents widespread adoption in high-performance computing (HPC). In this dissertation, we address this programmability gap. High-level synthesis (HLS) has increased productivity when designing FPGA architectures, but traditional software optimizations are insufficient to implement high-performance hardware architectures. To alleviate this, we present a set of key transformations for HLS, targeting scalable architectures for HPC applications, identifying classes of transformations and their effect in hardware, and boost the productivity of HLS developers with the hlslib open source project of productivity tools. Using these techniques, we present a model-based, end-to-end example of optimizing matrix multiplication for FPGAs, which yields competitive performance in practice and is published as an open source project. Venturing beyond HLS, we propose a new way to develop, optimize, and compile FPGA programs. The Data-Centric parallel programming (DaCe) framework allows applications to be defined by their dataflow and control flow through the Stateful DataFlow multiGraph (SDFG) representation, exposing a plethora of optimization opportunities. We unify general, domain-specific, and platform-specific optimizations in this flow, and present the FPGA backends of DaCe, emitting efficient HLS code for both Xilinx and Intel devices. Building on this infrastructure, we present StencilFlow, an end-to-end framework that maps general directed acyclic graphs of heterogeneous stencil operators to distributed FPGA architectures, maximizing temporal locality and ensuring deadlock freedom. We show the highest performance recorded for stencil programs for either FPGA vendor to date, and study a complex stencil program from a production weather simulation application. With the toolbox of transformations, open source software, and programming abstractions provided in this dissertation, we contribute to the productivity of HLS developers, performance engineers, domain scientists, and compiler engineers alike, bridging the gap for bringing spatial computing systems into the mainstream of HPC
FBLAS: Streaming Linear Algebra on FPGA
Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity, and the lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures requires factoring in new transformations and resources/performance trade-offs. We present FBLAS, an open-source HLS implementation of BLAS for FPGAs, that enables reusability, portability and easy integration with existing software and hardware codes. FBLAS' implementation allows scaling hardware modules to exploit on-chip resources, and module interfaces are designed to natively support streaming on-chip communications, allowing them to be composed to reduce off-chip communication. With FBLAS, we set a precedent for FPGA library design, and contribute to the toolbox of customizable hardware components necessary for HPC codes to start productively targeting reconfigurable platforms. © 2020 IEE