1,334 research outputs found
Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis
Data movement is the dominating factor affecting performance and energy in
modern computing systems. Consequently, many algorithms have been developed to
minimize the number of I/O operations for common computing patterns. Matrix
multiplication is no exception, and lower bounds have been proven and
implemented both for shared and distributed memory systems. Reconfigurable
hardware platforms are a lucrative target for I/O minimizing algorithms, as
they offer full control of memory accesses to the programmer. While bounds
developed in the context of fixed architectures still apply to these platforms,
the spatially distributed nature of their computational and memory resources
requires a decentralized approach to optimize algorithms for maximum hardware
utilization. We present a model to optimize matrix multiplication for FPGA
platforms, simultaneously targeting maximum performance and minimum off-chip
data movement, within constraints set by the hardware. We map the model to a
concrete architecture using a high-level synthesis tool, maintaining a high
level of abstraction, allowing us to support arbitrary data types, and enables
maintainability and portability across FPGA devices. Kernels generated from our
architecture are shown to offer competitive performance in practice, scaling
with both compute and memory resources. We offer our design as an open source
project to encourage the open development of linear algebra and I/O minimizing
algorithms on reconfigurable hardware platforms
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks
Fully realizing the potential of acceleration for Deep Neural Networks (DNNs)
requires understanding and leveraging algorithmic properties. This paper builds
upon the algorithmic insight that bitwidth of operations in DNNs can be reduced
without compromising their classification accuracy. However, to prevent
accuracy loss, the bitwidth varies significantly across DNNs and it may even be
adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer
limited benefits to accommodate the worst-case bitwidth requirements, or lead
to a degradation in final accuracy. To alleviate these deficiencies, this work
introduces dynamic bit-level fusion/decomposition as a new dimension in the
design of DNN accelerators. We explore this dimension by designing Bit Fusion,
a bit-flexible accelerator, that constitutes an array of bit-level processing
elements that dynamically fuse to match the bitwidth of individual DNN layers.
This flexibility in the architecture enables minimizing the computation and the
communication at the finest granularity possible with no loss in accuracy. We
evaluate the benefits of BitFusion using eight real-world feed-forward and
recurrent DNNs. The proposed microarchitecture is implemented in Verilog and
synthesized in 45 nm technology. Using the synthesis results and cycle accurate
simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN
accelerators, Eyeriss and Stripes. In the same area, frequency, and process
technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss.
Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction
at 45 nm node when BitFusion area and frequency are set to those of Stripes.
Scaling to GPU technology node of 16 nm, BitFusion almost matches the
performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while
BitFusion merely consumes 895 milliwatts of power
FBLAS: Streaming Linear Algebra on FPGA
Spatial computing architectures pose an attractive alternative to mitigate
control and data movement overheads typical of load-store architectures. In
practice, these devices are rarely considered in the HPC community due to the
steep learning curve, low productivity and lack of available libraries for
fundamental operations. High-level synthesis (HLS) tools are facilitating
hardware programming, but optimizing for these architectures requires factoring
in new transformations and resources/performance trade-offs. We present FBLAS,
an open-source HLS implementation of BLAS for FPGAs, that enables reusability,
portability and easy integration with existing software and hardware codes.
FBLAS' implementation allows scaling hardware modules to exploit on-chip
resources, and module interfaces are designed to natively support streaming
on-chip communications, allowing them to be composed to reduce off-chip
communication. With FBLAS, we set a precedent for FPGA library design, and
contribute to the toolbox of customizable hardware components necessary for HPC
codes to start productively targeting reconfigurable platforms
Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping
The multi-pumping resource sharing technique can overcome the limitations
commonly found in single-clocked FPGA designs by allowing hardware components
to operate at a higher clock frequency than the surrounding system. However,
this optimization cannot be expressed in high levels of abstraction, such as
HLS, requiring the use of hand-optimized RTL. In this paper we show how to
leverage multiple clock domains for computational subdomains on reconfigurable
devices through data movement analysis on high-level programs. We offer a novel
view on multi-pumping as a compiler optimization - a superclass of traditional
vectorization. As multiple data elements are fed and consumed, the computations
are packed temporally rather than spatially. The optimization is applied
automatically using an intermediate representation that maps high-level code to
HLS. Internally, the optimization injects modules into the generated designs,
incorporating RTL for fine-grained control over the clock domains. We obtain a
reduction of resource consumption by up to 50% on critical components and 23%
on average. For scalable designs, this can enable further parallelism,
increasing overall performance
An Scalable matrix computing unit architecture for FPGA and SCUMO user design interface
High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations' performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N x N matrices, the architecture requires N ALU-RAM blocks and performs O(N*N), requiring N*N +7 and N +7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 x 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units
- …