4,168 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Recommended from our members
Chippe : a system for constraint driven behavioral synthesis
This report describes the Chippe system, gives some background previous work and describes several sample design runs of the system. Also presented are the sources of the design tradeoffs used by Chippe, and overview of the internal design model, and experiences using the system
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements
Enhancing the performance of Decoupled Software Pipeline through Backward Slicing
The rapidly increasing number of cores available in multicore processors does
not necessarily lead directly to a commensurate increase in performance:
programs written in conventional languages, such as C, need careful
restructuring, preferably automatically, before the benefits can be observed in
improved run-times. Even then, much depends upon the intrinsic capacity of the
original program for concurrent execution. The subject of this paper is the
performance gains from the combined effect of the complementary techniques of
the Decoupled Software Pipeline (DSWP) and (backward) slicing. DSWP extracts
threadlevel parallelism from the body of a loop by breaking it into stages
which are then executed pipeline style: in effect cutting across the control
chain. Slicing, on the other hand, cuts the program along the control chain,
teasing out finer threads that depend on different variables (or locations).
parts that depend on different variables. The main contribution of this paper
is to demonstrate that the application of DSWP, followed by slicing offers
notable improvements over DSWP alone, especially when there is a loop-carried
dependence that prevents the application of the simpler DOALL optimization.
Experimental results show an improvement of a factor of ?1.6 for DSWP + slicing
over DSWP alone and a factor of ?2.4 for DSWP + slicing over the original
sequential code
A Standalone FPGA-based Miner for Lyra2REv2 Cryptocurrencies
Lyra2REv2 is a hashing algorithm that consists of a chain of individual
hashing algorithms, and it is used as a proof-of-work function in several
cryptocurrencies. The most crucial and exotic hashing algorithm in the
Lyra2REv2 chain is a specific instance of the general Lyra2 algorithm. This
work presents the first hardware implementation of the specific instance of
Lyra2 that is used in Lyra2REv2. Several properties of the aforementioned
algorithm are exploited in order to optimize the design. In addition, an
FPGA-based hardware implementation of a standalone miner for Lyra2REv2 on a
Xilinx Multi-Processor System on Chip is presented. The proposed Lyra2REv2
miner is shown to be significantly more energy efficient than both a GPU and a
commercially available FPGA-based miner. Finally, we also explain how the
simplified Lyra2 and Lyra2REv2 architectures can be modified with minimal
effort to also support the recent Lyra2REv3 chained hashing algorithm.Comment: 13 pages, accepted for publication in IEEE Trans. Circuits Syst. I.
arXiv admin note: substantial text overlap with arXiv:1807.0576
Recommended from our members
Percolation-based compiling for evaluation of parallelism and hardware design trade-offs
This thesis investigates parallelism and hardware design trade-offs of parallel and pipelined architectures. To explore these trade-offs we developed a retargetable compiler based on a set of powerful code transformations called Percolation Scheduling (PS) that map programs with real-time constraints and/or massive time requirements onto synchronous, parallel, high-performance or semi-custom architectures.High-performance is achieved through extraction of application inherent fine-grain parallelism and the use of a suitable architecture. Exploiting fine-grain parallelism is a critical part of exploiting all of the parallelism available in a given program, particularly since highly irregular forms of parallelism are often not visible at coarser levels and since the use of low-level parallelism has a multiplicative effect on the overall performance.To extract substantial parallelism from both the hardware and the compiler, we use a clean, highly parallel VLIW-like architecture that is synchronous, has multiple functional units and has a single program counter. The use of a hazard-free and homogeneous architecture does not result only in a better VLSI design but also considerably increases the compiler's ability to produce better code. To further enhance parallelism we modified the uni-cycle VLIW model and extended the transformations such that pipelined units that provide extra parallelism are used.Another approach presented is of resource constrained scheduling (RCS). Since the RCS problem is known to be NP-hard, in practice it may be solved only by a heuristic approach. We argue that using the heuristic after extraction of the unlimited-resources schedule may yield better results than if the heuristic has been applied at the beginning of the scheduling process.Through a series of benchmarks we evaluate hardware design trade-offs and show that speed-ups on average of one order of magnitude are feasible with sufficient functional units. However, when resources are limited we show that the number of functional units needed may be optimized for a particular suite of application programs
Recommended from our members
Pipelining of register transfer netlists
This paper describes a method for pipelining of register-to-register netlists. We define algorithms for inserting latches in a data path, both inside each unit and between the units as well as between control logic and the data path and for readjusting the state transition table. Experimental results on several benchmarks show 30%-40% improvement in performance
- …