1,970 research outputs found
High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP
This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs
ParaFPGA15 : exploring threads and trends in programmable hardware
The symposium ParaFPGA focuses on parallel techniques using FPGAs as accelerator in high performance computing. The green computing aspects of low power consumption at high performance were somewhat tempered by long design cycles and hard programmability issues. However, in recent years FPGAs have become new contenders as versatile compute accelerators because of a growing market interest, extended application domains and maturing high-level synthesis tools. The keynote paper highlights the historical and modern approaches to high-level FPGA and the contributions cover applications such as NP-complete satisfiability problems and convex hull image processing as well as performance evaluation, partial reconfiguration and systematic design exploration
Weighing up the new kid on the block: Impressions of using Vitis for HPC software development
The use of reconfigurable computing, and FPGAs in particular, has strong
potential in the field of High Performance Computing (HPC). However the
traditionally high barrier to entry when it comes to programming this
technology has, until now, precluded widespread adoption. To popularise
reconfigurable computing with communities such as HPC, Xilinx have recently
released the first version of Vitis, a platform aimed at making the programming
of FPGAs much more a question of software development rather than hardware
design. However a key question is how well this technology fulfils the aim, and
whether the tooling is mature enough such that software developers using FPGAs
to accelerate their codes is now a more realistic proposition, or whether it
simply increases the convenience for existing experts. To examine this question
we use the Himeno benchmark as a vehicle for exploring the Vitis platform for
building, executing and optimising HPC codes, describing the different steps
and potential pitfalls of the technology. The outcome of this exploration is a
demonstration that, whilst Vitis is an excellent step forwards and
significantly lowers the barrier to entry in developing codes for FPGAs, it is
not a silver bullet and an underlying understanding of dataflow style
algorithmic design and appreciation of the architecture is still key to
obtaining good performance on reconfigurable architectures.Comment: Pre-print of Weighing up the new kid on the block: Impressions of
using Vitis for HPC software development, paper in 30th International
Conference on Field Programmable Logic and Application
It's all about data movement: Optimising FPGA data access to boost performance
The use of reconfigurable computing, and FPGAs in particular, to accelerate
computational kernels has the potential to be of great benefit to scientific
codes and the HPC community in general. However, whilst recent advanced in FPGA
tooling have made the physical act of programming reconfigurable architectures
much more accessible, in order to gain good performance the entire algorithm
must be rethought and recast in a dataflow style. Reducing the cost of data
movement for all computing devices is critically important, and in this paper
we explore the most appropriate techniques for FPGAs. We do this by describing
the optimisation of an existing FPGA implementation of an atmospheric model's
advection scheme. By taking an FPGA code that was over four times slower than
running on the CPU, mainly due to data movement overhead, we describe the
profiling and optimisation strategies adopted to significantly reduce the
runtime and bring the performance of our FPGA kernels to a much more practical
level for real-world use. The result of this work is a set of techniques,
steps, and lessons learnt that we have found significantly improves the
performance of FPGA based HPC codes and that others can adopt in their own
codes to achieve similar results.Comment: Preprint of article in 2019 IEEE/ACM International Workshop on
Heterogeneous High-performance Reconfigurable Computing (H2RC
Application specific dataflow machine construction for programming FPGAs via Lucent
Field Programmable Gate Arrays (FPGAs) have the potential to accelerate
specific HPC codes. However even with the advent of High Level Synthesis (HLS),
which enables FPGA programmers to write code in C or C++, programming such
devices still requires considerable expertise. Much of this is due to the fact
that these architectures are founded on dataflow rather than the Von Neumann
abstraction of CPUs or GPUs. Thus programming FPGAs via imperative languages is
not optimal and can result in very significant performance differences between
the first and final versions of algorithms on dataflow architectures with the
steps in between often not obvious and requiring considerable expertise.
In this position paper we argue that languages built upon dataflow principals
should be exploited to enable fast by construction codes for FPGAs, and this is
akin to the programmer adopting the abstraction of developing a bespoke
dataflow machine specialised for their application. It is our belief that much
can be learnt from the generation of dataflow languages that gained popularity
in the 1970s and 1980s around programming general purpose dataflow machines,
and we introduce Lucent which is a modern derivative of Lucid, and used as a
vehicle to explore this hypothesis. The idea behind Lucent is to provide high
programmer productivity and performance for FPGAs by giving developers the most
suitable language level abstractions. The focus of Lucent is very much to
support the acceleration of HPC kernels, rather than the embedded electronics
and circuit level, and we provide a brief overview of the language driven by
examples.Comment: Accepted at the LATTE (Languages, Tools, and Techniques for
Accelerator Design) ASPLOS worksho
Proxy Circuits for Fault-Tolerant Primitive Interfacing in Reconfigurable Devices Targeting Extreme Environments
Continuous interface access to device-level primitives in reconfigurable devices in extreme environments is key to reliable operation. However, it is possible for a primitive's interface controller, which is static to be rendered non-operational by a permanent damage in the controller's circuitry. In order to mitigate this, this paper proposes the use of relocatable proxy circuits to provide remote interfacing capability to primitives from anywhere on a reconfigurable device. A demonstration with device register read controller shows that an improvement in fault-tolerance can be achieved
Accelerating advection for atmospheric modelling on Xilinx and Intel FPGAs
Reconfigurable architectures, such as FPGAs, enable the execution of code at
the electronics level, avoiding the assumptions imposed by the general purpose
black-box micro-architectures of CPUs and GPUs. Such tailored execution can
result in increased performance and power efficiency, and as the HPC community
moves towards exascale an important question is the role such hardware
technologies can play in future supercomputers.
In this paper we explore the porting of the PW advection kernel, an important
code component used in a variety of atmospheric simulations and accounting for
around 40\% of the runtime of the popular Met Office NERC Cloud model (MONC).
Building upon previous work which ported this kernel to an older generation of
Xilinx FPGA, we target latest generation Xilinx Alveo U280 and Intel Stratix 10
FPGAs. Exploring the development of a dataflow design which is performance
portable between vendors, we then describe implementation differences between
the tool chains and compare kernel performance between FPGA hardware. This is
followed by a more general performance comparison, scaling up the number of
kernels on the Xilinx Alveo and Intel Stratix 10, against a 24 core Xeon
Platinum Cascade Lake CPU and NVIDIA Tesla V100 GPU. When overlapping the
transfer of data to and from the boards with compute, the FPGA solutions
considerably outperform the CPU and, whilst falling short of the GPU in terms
of performance, demonstrate power usage benefits, with the Alveo being
especially power efficient. The result of this work is a comparison and set of
design techniques that apply both to this specific atmospheric advection kernel
on Xilinx and Intel FPGAs, and that are also of interest more widely when
looking to accelerate HPC codes on a variety of reconfigurable architectures.Comment: Preprint of article in the IEEE Cluster FPGA for HPC Workshop 2021
(HPC FPGA 2021
- …