1,081 research outputs found
Automatic Nested Loop Acceleration on FPGAs Using Soft CGRA Overlay
Session 1: HLS Toolingpostprin
High performance computing with FPGAs
Field-programmable gate arrays represent an army of logical units which can be organized in a highly parallel or pipelined fashion to implement an algorithm in hardware. The flexibility of this new medium creates new challenges to find the right processing paradigm which takes into account of the natural constraints of FPGAs: clock frequency, memory footprint and communication bandwidth. In this paper first use of FPGAs as a multiprocessor on a chip or its use as a highly functional coprocessor are compared, and the programming tools for hardware/software codesign are discussed. Next a number of techniques are presented to maximize the parallelism and optimize the data locality in nested loops. This includes unimodular transformations, data locality improving loop transformations and use of smart buffers. Finally, the use of these techniques on a number of examples is demonstrated.
The results in the paper and in the literature show that, with the proper programming tool set, FPGAs can speedup computation kernels significantly with respect to traditional processors
Application-Level Performance Improvement for Stream Program on CGRA-based systems
Department of Computer EngineeringCoarse-Grained Reconfigurable Architectures (CGRAs), often used as coprocessors for DSP and multimedia kernels, can deliver highly energy-effcient execution for compute-intensive kernels. Simultaneously, stream applications, which consist of many actors and channels connecting them, can provide natural representations for DSP applications, and therefore be a good match for CGRAs. We present our results of mapping DSP applications written in StreamIt language to CGRAs, along with our mapping flow. One important challenge in mapping is how to manage the multitude of kernels in the application for the limited local memory of a CGRA, for which we present a novel integer linear programming-based solution. Our evaluation results demonstrate that our software and hardware optimizations can help generate highly effcient mapping of stream applications to CGRAs, enabling far more energy-effcient executions (7x worse to 50x better) compared to using state-of-theart GP-GPUs. Further, we eliminate communication overhead and reduce computation overhead using combination of sychronous/asynchronous processors and DMA. This optimization also improve performance by 17.1% on average comparing to baseline system.ope
Data locality and parallelism optimization using a constraint-based approach
Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In
the context of data-intensive embedded applications, there have been two complementary approaches to
enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism.
Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher
levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require
a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip
memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This
can be achieved by considering multiple loop nests simultaneously. Although compilers address these two
problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously.
Therefore, an integrated approach that combines these two can generate much better results than each
individual approach. Based on these observations, this paper proposes a constraint network (CN)-based
formulation for data locality optimization and code parallelization. The paper also presents experimental
evidence, demonstrating the success of the proposed approach, and compares our results with those
obtained through previously proposed approaches. The experiments from our implementation indicate
that the proposed approach is very effective in enhancing data locality and parallelization.
© 2010 Elsevier Inc. All rights reserved
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
Scaling Kernel Speedup to Application-Level Performance with CGRAS: Stream Program
Department of Electrical EngineeringWhile accelerators often generate impressive speedup at the kernel level, the speedup often do not scale to the application-level performance improvement due to several reasons.
In this paper we identify key factors impacting the application-level performance of CGRA (Coarse-Grained Recon???gurable Architecture) accelerators using stream
programs as the target application.
As a practical remedy, we also propose a low-cost architecture extension focusing on the nested loops appearing very frequently in stream programs.
We also present detailed application-level performance evaluation for the full StreamIt benchmark applications, which suggests that CGRAs can realistically accelerate
stream applications by 3.6???4.0 times on average, compared to software-only execution on a typical mobile processor.ope
An embedded system supporting dynamic partial reconfiguration of hardware resources for morphological image processing
Processors for high-performance computing applications are generally designed with a focus on high clock rates, parallelism of operations and high communication bandwidth, often at the expense of large power consumption. However, the emphasis of many embedded systems and untethered devices is on minimal hardware requirements and reduced power consumption. With the incessant growth of computational needs for embedded applications, which contradict chip power and area needs, the burden is put on the hardware designers to come up with designs that optimize power and area requirements.
This thesis investigates the efficient design of an embedded system for morphological image processing applications on Xilinx FPGAs (Field Programmable Gate Array) by optimizing both area and power usage while delivering high performance. The design leverages a unique capability of FPGAs called dynamic partial reconfiguration (DPR) which allows changing the hardware configuration of silicon pieces at runtime. DPR allows regions of the FPGA to be reprogrammed with new functionality while applications are still running in the remainder of the device.
The main aim of this thesis is to design an embedded system for morphological image processing by accounting for real time and area constraints as compared to a statically configured FPGA. IP (Intellectual Property) cores are synthesized for both static and dynamic time. DPR enables instantiation of more hardware logic over a period of time on an existing device by time-multiplexing the hardware realization of functions. A comparison of power consumption is presented for the statically and dynamically reconfigured designs. Finally, a performance comparison is included for the implementation of the respective algorithms on a hardwired ARM processor as well as on another general-purpose processor. The results prove the viability of DPR for morphological image processing applications
- …