23 research outputs found
BandMap: Application Mapping with Bandwidth Allocation forCoarse-Grained Reconfigurable Array
This paper proposes an application mapping algorithm, BandMap, for
coarse-grained reconfigurable array (CGRA), which allocates the bandwidth in PE
array according to the transferring demands of data, especially the data with
high spatial reuse, to reduce the routing PEs. To cover bandwidth allocation,
BandMap maps the data flow graphs (DFGs), abstracted from applications, by
solving the maximum independent set (MIS) on a mixture of tuple and quadruple
resource occupation conflict graph. Compared to a state-of-art BusMap work,
Bandmap can achieve reduced routing PEs with the same or even smaller
initiation interval (II)
WindMill: A Parameterized and Pluggable CGRA Implemented by DIAG Design Flow
With the cross-fertilization of applications and the ever-increasing scale of
models, the efficiency and productivity of hardware computing architectures
have become inadequate. This inadequacy further exacerbates issues in design
flexibility, design complexity, development cycle, and development costs (4-d
problems) in divergent scenarios. To address these challenges, this paper
proposed a flexible design flow called DIAG based on plugin techniques. The
proposed flow guides hardware development through four layers: definition(D),
implementation(I), application(A), and generation(G). Furthermore, a versatile
CGRA generator called WindMill is implemented, allowing for agile generation of
customized hardware accelerators based on specific application demands.
Applications and algorithm tasks from three aspects is experimented. In the
case of reinforcement learning algorithm, a significant performance improvement
of compared to GPU is achieved.Comment: 7 pages, 10 figure
An Intermediate Representation for Composable Typed Streaming Dataflow Designs
Tydi is an open specification for streaming dataflow designs in digital
circuits, allowing designers to express how composite and variable-length data
structures are transferred over streams using clear, data-centric types. These
data types are extensively used in a many application domains, such as big data
and SQL applications. This way, Tydi provides a higher-level method for
defining interfaces between components as opposed to existing bit and
byte-based interface specifications. In this paper, we introduce an open-source
intermediate representation (IR) which allows for the declaration of Tydi's
types. The IR enables creating and connecting components with Tydi Streams as
interfaces, called Streamlets. It also lets backends for synthesis and
simulation retain high-level information, such as documentation. Types and
Streamlets can be easily reused between multiple projects, and Tydi's streams
and type hierarchy can be used to define interface contracts, which aid
collaboration when designing a larger system. The IR codifies the rules and
properties established in the Tydi specification and serves to complement
computation-oriented hardware design tools with a data-centric view on
interfaces. To support different backends and targets, the IR is focused on
expressing interfaces, and complements behavior described by hardware
description languages and other IRs. Additionally, a testing syntax for the
verification of inputs and outputs against abstract streams of data, and for
substituting interdependent components, is presented which allows for the
specification of behavior. To demonstrate this IR, we have created a grammar,
parser, and query system, and paired these with a backend targeting VHDL.Comment: arXiv admin note: substantial text overlap with arXiv:2212.1200
A General Framework for Accelerator Management Based on ISA Extension
Thanks to the promised improvements in performance and energy efficiency, hardware accelerators are taking momentum in many computing contexts, both in terms of variety and relative weight in the silicon area of many chips. Commonly, the way an application interacts with these hardware modules has many accelerator-specific traits and requires ad-hoc drivers that usually rely on potentially expensive system calls to manage accelerator resources and access orchestration. As a consequence, driver-based interfacing is far from uniform and can expose high latency, limiting the set of tasks suitable for acceleration. In this paper, we propose a uniform and low-latency interface based on Instruction Set Architecture (ISA) extension. All the previous studies that proposed extensions, were deeply tailored to address a single accelerator. One of the biggest disadvantages of those methods is their inability to scale. Adding more of these accelerators to one System-on-Chip (SoC) would result in ISA bloat, increasing power consumption and complexifying the decoding phase proportionally. Our proposed framework consists of a six-instruction ISA extension and the corresponding architectural support that implements the interface abstraction and the reservation logic at the hardware level. Our proposal allows controlling a broad class of integrated accelerators directly from the CPU. The proposed framework is ISA-independent, which means that it is applicable to all the existing ISAs. We implement it on the gem5 simulator by extending the RISC-V ISA. We evaluate it by simulating three compute-intensive accelerators and comparing our interfacing with a conventional driver-based one. The benchmarks highlight the performance benefits brought by our framework, with up to 10.38x speed up, as well as the ability to seamlessly support different accelerators with the same interface. The speed up advantage of our technique diminishes as the granularity of the workloads increases and the overhead for driver-based accelerators becomes less important. We also show that the impact of its hardware components on chip area and power consumption is limited
Revet: A Language and Compiler for Dataflow Threads
Spatial dataflow architectures such as reconfigurable dataflow accelerators
(RDA) can provide much higher performance and efficiency than CPUs and GPUs. In
particular, vectorized reconfigurable dataflow accelerators (vRDA) in recent
literature represent a design point that enhances the efficiency of dataflow
architectures with vectorization. Today, vRDAs can be exploited using either
hardcoded kernels or MapReduce languages like Spatial, which cannot vectorize
data-dependent control flow. In contrast, CPUs and GPUs can be programmed using
general-purpose threaded abstractions.
The ideal combination would be the generality of a threaded programming model
coupled with the efficient execution model of a vRDA. We introduce Revet: a
programming model, compiler, and execution model that lets threaded
applications run efficiently on vRDAs. The Revet programming language uses
threads to support a broader range of applications than Spatial's parallel
patterns, and our MLIR-based compiler lowers this language to a generic
dataflow backend that operates on streaming tensors. Finally, we show that
mapping threads to dataflow outperforms GPUs, the current state-of-the-art for
threaded accelerators, by 3.8x.Comment: To appear in HPCA 202
Flip: Data-Centric Edge CGRA Accelerator
Coarse-Grained Reconfigurable Arrays (CGRA) are promising edge accelerators
due to the outstanding balance in flexibility, performance, and energy
efficiency. Classic CGRAs statically map compute operations onto the processing
elements (PE) and route the data dependencies among the operations through the
Network-on-Chip. However, CGRAs are designed for fine-grained static
instruction-level parallelism and struggle to accelerate applications with
dynamic and irregular data-level parallelism, such as graph processing. To
address this limitation, we present Flip, a novel accelerator that enhances
traditional CGRA architectures to boost the performance of graph applications.
Flip retains the classic CGRA execution model while introducing a special
data-centric mode for efficient graph processing. Specifically, it exploits the
natural data parallelism of graph algorithms by mapping graph vertices onto
processing elements (PEs) rather than the operations, and supporting dynamic
routing of temporary data according to the runtime evolution of the graph
frontier. Experimental results demonstrate that Flip achieves up to 36
speedup with merely 19% more area compared to classic CGRAs. Compared to
state-of-the-art large-scale graph processors, Flip has similar energy
efficiency and 2.2 better area efficiency at a much-reduced power/area
budget
A Quantitative Analysis and Guideline of Data Streaming Accelerator in Intel 4th Gen Xeon Scalable Processors
As semiconductor power density is no longer constant with the technology
process scaling down, modern CPUs are integrating capable data accelerators on
chip, aiming to improve performance and efficiency for a wide range of
applications and usages. One such accelerator is the Intel Data Streaming
Accelerator (DSA) introduced in Intel 4th Generation Xeon Scalable CPUs
(Sapphire Rapids). DSA targets data movement operations in memory that are
common sources of overhead in datacenter workloads and infrastructure. In
addition, it becomes much more versatile by supporting a wider range of
operations on streaming data, such as CRC32 calculations, delta record
creation/merging, and data integrity field (DIF) operations. This paper sets
out to introduce the latest features supported by DSA, deep-dive into its
versatility, and analyze its throughput benefits through a comprehensive
evaluation. Along with the analysis of its characteristics, and the rich
software ecosystem of DSA, we summarize several insights and guidelines for the
programmer to make the most out of DSA, and use an in-depth case study of DPDK
Vhost to demonstrate how these guidelines benefit a real application