2,075 research outputs found
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
Generic Connectivity-Based CGRA Mapping via Integer Linear Programming
Coarse-grained reconfigurable architectures (CGRAs) are programmable logic
devices with large coarse-grained ALU-like logic blocks, and multi-bit
datapath-style routing. CGRAs often have relatively restricted data routing
networks, so they attract CAD mapping tools that use exact methods, such as
Integer Linear Programming (ILP). However, tools that target general
architectures must use large constraint systems to fully describe an
architecture's flexibility, resulting in lengthy run-times. In this paper, we
propose to derive connectivity information from an otherwise generic device
model, and use this to create simpler ILPs, which we combine in an iterative
schedule and retain most of the exactness of a fully-generic ILP approach. This
new approach has a speed-up geometric mean of 5.88x when considering benchmarks
that do not hit a time-limit of 7.5 hours on the fully-generic ILP, and 37.6x
otherwise. This was measured using the set of benchmarks used to originally
evaluate the fully-generic approach and several more benchmarks representing
computation tasks, over three different CGRA architectures. All run-times of
the new approach are less than 20 minutes, with 90th percentile time of 410
seconds. The proposed mapping techniques are integrated into, and evaluated
using the open-source CGRA-ME architecture modelling and exploration framework.Comment: 8 pages of content; 8 figures; 3 tables; to appear in FCCM 2019; Uses
the CGRA-ME framework at http://cgra-me.ece.utoronto.ca
Achieving Efficient Realization of Kalman Filter on CGRA through Algorithm-Architecture Co-design
In this paper, we present efficient realization of Kalman Filter (KF) that
can achieve up to 65% of the theoretical peak performance of underlying
architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA)
as a basic building block due to its versatility and REDEFINE Coarse Grained
Reconfigurable Architecture (CGRA) is used as a platform for experiments since
REDEFINE is capable of supporting realization of a set algorithmic compute
structures at run-time on a Reconfigurable Data-path (RDP). We perform several
hardware and software based optimizations in the realization of KF to achieve
116% improvement in terms of Gflops over the first realization of KF. Overall,
with the presented approach for KF, 4-105x performance improvement in terms of
Gflops/watt over several academically and commercially available realizations
of KF is attained. In REDEFINE, we show that our implementation is scalable and
the performance attained is commensurate with the underlying hardware resourcesComment: Accepted in ARC 201
High throughput accelerator interface framework for a linear time-multiplexed FPGA overlay
Coarse-grained FPGA overlays improve design productivity through software-like programmability and fast compilation. However, the effectiveness of overlays as accelerators is dependent on suitable interface and programming integration into a typically processor-based computing system, an aspect which has often been neglected in evaluations of overlays. We explore the integration of a time-multiplexed FPGA overlay over a server-class PCI Express interface. We show how this integration can be optimised to maximise performance, and evaluate the area overhead. We also propose a user-friendly programming model for such an overlay accelerator system
Mapping the CgrA regulon of Rhodospirillum centenum reveals a hierarchal network controlling Gram-negative cyst development
Table S2. A table of all called CgrA ChIP-seq peaks and their locations on the genome. (PDF 286 kb
Scaling Kernel Speedup to Application-Level Performance with CGRAS: Stream Program
Department of Electrical EngineeringWhile accelerators often generate impressive speedup at the kernel level, the speedup often do not scale to the application-level performance improvement due to several reasons.
In this paper we identify key factors impacting the application-level performance of CGRA (Coarse-Grained Recon???gurable Architecture) accelerators using stream
programs as the target application.
As a practical remedy, we also propose a low-cost architecture extension focusing on the nested loops appearing very frequently in stream programs.
We also present detailed application-level performance evaluation for the full StreamIt benchmark applications, which suggests that CGRAs can realistically accelerate
stream applications by 3.6???4.0 times on average, compared to software-only execution on a typical mobile processor.ope
- …
