130 research outputs found
Scalable Register File Architecture for CGRA Accelerators
abstract: Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable
of accelerating even non-parallel loops and loops with low trip-counts. One challenge
in compiling for CGRAs is to manage both recurring and nonrecurring variables in
the register file (RF) of the CGRA. Although prior works have managed recurring
variables via rotating RF, they access the nonrecurring variables through either a
global RF or from a constant memory. The former does not scale well, and the latter
degrades the mapping quality. This work proposes a hardware-software codesign
approach in order to manage all the variables in a local nonrotating RF. Hardware
provides modulo addition based indexing mechanism to enable correct addressing
of recurring variables in a nonrotating RF. The compiler determines the number of
registers required for each recurring variable and configures the boundary between the
registers used for recurring and nonrecurring variables. The compiler also pre-loads
the read-only variables and constants into the local registers in the prologue of the
schedule. Synthesis and place-and-route results of the previous and the proposed RF
design show that proposed solution achieves 17% better cycle time. Experiments of
mapping several important and performance-critical loops collected from MiBench
show proposed approach improves performance (through better mapping) by 18%,
compared to using constant memory.Dissertation/ThesisMasters Thesis Computer Science 201
Flip: Data-Centric Edge CGRA Accelerator
Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators
due to the outstanding balance in flexibility, performance, and energy
efficiency. Classic CGRAs statically map compute operations onto the processing
elements (PE) and route the data dependencies among the operations through the
Network-on-Chip. However, CGRAs are designed for fine-grained static
instruction-level parallelism and struggle to accelerate applications with
dynamic and irregular data-level parallelism, such as graph processing. To
address this limitation, we present Flip, a novel accelerator that enhances
traditional CGRA architectures to boost the performance of graph applications.
Flip retains the classic CGRA execution model while introducing a special
data-centric mode for efficient graph processing. Specifically, it exploits the
natural data parallelism of graph algorithms by mapping graph vertices onto
processing elements (PEs) rather than the operations, and supporting dynamic
routing of temporary data according to the runtime evolution of the graph
frontier. Experimental results demonstrate that Flip achieves up to 36
speedup with merely 19% more area compared to classic CGRAs. Compared to
state-of-the-art large-scale graph processors, Flip has similar energy
efficiency and 2.2 better area efficiency at a much-reduced power/area
budget
Efficient Hardware Architectures for Accelerating Deep Neural Networks: Survey
In the modern-day era of technology, a paradigm shift has been witnessed in the areas involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). Specifically, Deep Neural Networks (DNNs) have emerged as a popular field of interest in most AI applications such as computer vision, image and video processing, robotics, etc. In the context of developed digital technologies and the availability of authentic data and data handling infrastructure, DNNs have been a credible choice for solving more complex real-life problems. The performance and accuracy of a DNN is a way better than human intelligence in certain situations. However, it is noteworthy that the DNN is computationally too cumbersome in terms of the resources and time to handle these computations. Furthermore, general-purpose architectures like CPUs have issues in handling such computationally intensive algorithms. Therefore, a lot of interest and efforts have been invested by the research fraternity in specialized hardware architectures such as Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and Coarse Grained Reconfigurable Array (CGRA) in the context of effective implementation of computationally intensive algorithms. This paper brings forward the various research works carried out on the development and deployment of DNNs using the aforementioned specialized hardware architectures and embedded AI accelerators. The review discusses the detailed description of the specialized hardware-based accelerators used in the training and/or inference of DNN. A comparative study based on factors like power, area, and throughput, is also made on the various accelerators discussed. Finally, future research and development directions are discussed, such as future trends in DNN implementation on specialized hardware accelerators. This review article is intended to serve as a guide for hardware architectures for accelerating and improving the effectiveness of deep learning research.publishedVersio
Rewriting History: Repurposing Domain-Specific CGRAs
Coarse-grained reconfigurable arrays (CGRAs) are domain-specific devices
promising both the flexibility of FPGAs and the performance of ASICs. However,
with restricted domains comes a danger: designing chips that cannot accelerate
enough current and future software to justify the hardware cost. We introduce
FlexC, the first flexible CGRA compiler, which allows CGRAs to be adapted to
operations they do not natively support.
FlexC uses dataflow rewriting, replacing unsupported regions of code with
equivalent operations that are supported by the CGRA. We use equality
saturation, a technique enabling efficient exploration of a large space of
rewrite rules, to effectively search through the program-space for supported
programs. We applied FlexC to over 2,000 loop kernels, compiling to four
different research CGRAs and 300 generated CGRAs and demonstrate a 2.2
increase in the number of loop kernels accelerated leading to 3 speedup
compared to an Arm A5 CPU on kernels that would otherwise be unsupported by the
accelerator
Acceleration for the many, not the few
Although specialized hardware promises orders of magnitude performance gains, their
uptake has been limited by how challenging it is to program them. Hardware accelerators
present challenges programmers are not used to, exposing details of the hardware that
are often hidden and requiring new programming styles to use them effectively.
Existing programming models often involve learning complex and hardware-specific
APIs, using Domain Specific Languages (DSLs), or programming in customized assembly languages. These programming models for hardware accelerators present a
significant challenge to uptake: a steep, unforgiving, and untransferable learning curve.
However, programming hardware accelerators using traditional programming models
presents a challenge: mapping code not written with hardware accelerators in mind to
accelerators with restricted behaviour.
This thesis presents these challenges in the context of the acceleration equation, and
it presents solutions to it in three different contexts: for regular expression accelerators,
for API-programmable accelerators (with Fourier Transforms as a key case-study) and
for heterogeneous coarse-grained reconfigurable arrays (CGRAs). This thesis shows
that automatically morphing software written in traditional manners to fit hardware
accelerators is possible with no programmer effort and that huge potential speedups are
available
- …