5,708 research outputs found
High throughput spatial convolution filters on FPGAs
Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility
Exact and heuristic allocation of multi-kernel applications to multi-FPGA platforms
FPGA-based accelerators demonstrated high energy efficiency compared to GPUs and CPUs. However, single FPGA designs may not achieve sufficient task parallelism. In this work, we optimize the mapping of high-performance multi-kernel applications, like Convolutional Neural Networks, to multi-FPGA platforms. First, we formulate the system level optimization problem, choosing within a huge design space the parallelism and number of compute units for each kernel in the pipeline. Then we solve it using a combination of Geometric Programming, producing the optimum performance solution given resource and DRAM bandwidth constraints, and a heuristic allocator of the compute units on the FPGA cluster.Peer ReviewedPostprint (author's final draft
Document Classification Systems in Heterogeneous Computing Environments
Datacenter workloads demand high throughput, low cost and power efficient solutions. In most data centers the operating costs dominates the infrastructure cost. The ever growing amounts of data and the critical need for higher throughput, more energy efficient document classification solutions motivated us to investigate alternatives to the traditional homogeneous CPU based implementations of document classification systems. Several heterogeneous systems were investigated in the past where CPUs were combined with GPUs and FPGAs as system accelerators. The increasing complexity of FPGAs made them an interesting device in the heterogeneous computing environments and on the other hand difficult to program using Hardware Description languages. We explore the trade-offs when using high level synthesis and low level synthesis when programming FPGAs. Using low level synthesis results in less hardware resource usage on FPGAs and also offers the higher throughput compared to using HLS tool. While using HLS tool different heterogeneous computing devices such as multicore CPU and GPU targeted. Through our implementation experience and empirical results for data centric applications, we conclude that we can achieve power efficient results for these set of applications by either using low level synthesis or high level synthesis for programming FPGAs
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
A Reconfigurable Vector Instruction Processor for Accelerating a Convection Parametrization Model on FPGAs
High Performance Computing (HPC) platforms allow scientists to model
computationally intensive algorithms. HPC clusters increasingly use
General-Purpose Graphics Processing Units (GPGPUs) as accelerators; FPGAs
provide an attractive alternative to GPGPUs for use as co-processors, but they
are still far from being mainstream due to a number of challenges faced when
using FPGA-based platforms. Our research aims to make FPGA-based high
performance computing more accessible to the scientific community. In this work
we present the results of investigating the acceleration of a particular
atmospheric model, Flexpart, on FPGAs. We focus on accelerating the most
computationally intensive kernel from this model. The key contribution of our
work is the architectural exploration we undertook to arrive at a solution that
best exploits the parallelism available in the legacy code, and is also
convenient to program, so that eventually the compilation of high-level legacy
code to our architecture can be fully automated. We present the three different
types of architecture, comparing their resource utilization and performance,
and propose that an architecture where there are a number of computational
cores, each built along the lines of a vector instruction processor, works best
in this particular scenario, and is a promising candidate for a generic
FPGA-based platform for scientific computation. We also present the results of
experiments done with various configuration parameters of the proposed
architecture, to show its utility in adapting to a range of scientific
applications.Comment: This is an extended pre-print version of work that was presented at
the international symposium on Highly Efficient Accelerators and
Reconfigurable Technologies (HEART2014), Sendai, Japan, June 911, 201
- …